Distance Matrices
Due to various requests, this page will provide the experimental data as used in the paper
Haasdonk, B., Bahlmann, C.
Learning with Distance Substitution Kernels.
Pattern Recognition - Proc. of the 26th DAGM Symposium, Tübingen, Germany, August/September 2004. Springer Berlin, 2004.
(
.ps ,
.pdf )
Additional information and larger datasets are occasionally available.
Please note that most of the data was provided by other parties, as
indicated in corresponding references.
Proteins
- MATLAB data: proteins.mat
- Description:
Distance matrix of evolutionary distances of
226 proteins. One distance matrix with four
different binary labelings (one vs rest) corresponding to four major
classes are given.
The classes are defined by the initial two characters of their protein
codes in the original dataset. class 1 (HA, 72 samples),
class 2 (HB, 72 samples),
class 3 (MY, 39 samples),
class 4 (GG,GP, 30 samples). The remaining samples
(HZ,HD,HG,HE,HF, 13 samples) are labeled as rest.
- Note: Some references obviously use a different labelling, as the
class "M", which supposedly corresponds to our class 3,
is reported to consist of 37 samples there.
- Original data: In Graepel et al. 1999, the original data including the complete chemical labels (ascii-format) is stated to origin
from M. Vingron and T. Hofmann. As we lack explicit permission,
we do not post these sets on this page, but can provide it individually
by email.
- References:
- T. Graepel, R. Herbrich, P. Bollmann-Sdorra and K. Obermayer.
Classification on pairwise proximity data. In Adv. Neural Info. Proc. Syst.,
volume 11, pages 438-444, Cambridge, MA, 1999.MIT Press.
-
T. Graepel, R. Herbrich, B. Schölkopf, A. Smola, P. Bartlett,
K.-R. Müller, K. Obermayer and
B. Williamson. Classification on proximity data
with LP-machines. In Proc. 9th ICANN, pp. 304-309, London, 1999, IEE.
Cat-Cortex
- MATLAB data: cat-cortex.mat
- Description:
Distance matrix of
connection strengths between 65 regions of the cat's celebral cortex
and 4 different binary labellings of four functional classes.
Class 1 (visual,V, 18 regions), class 2 (auditory, A, 10 regions), class 3
(somatosensory, S, 18 regions) and class 4 (frontolimbic, F, 19 regions).
The distances are ranging from 0 to 4 in 0.5 steps resulting from pairwise
averaging the original non-symmetric distances.
- Note: Some references obviously use a different labelling, as the
class V and S is reported to consist of one more resp. less samples there.
- Original data: The original data (ascii-format) can be provided
on request. We did not obtain permission for web-publishing up to now.
- References:
- T. Graepel, R. Herbrich, P. Bollmann-Sdorra and K. Obermayer.
Classification on pairwise proximity data. In Adv. Neural Info. Proc. Syst.,
volume 11, pages 438-444, Cambridge, MA, 1999.MIT Press.
-
T. Graepel, R. Herbrich, B. Schölkopf, A. Smola, P. Bartlett,
K.-R. Müller, K. Obermayer and
B. Williamson. Classification on proximity data
with LP-machines. In Proc. 9th ICANN, pp. 304-309, London, 1999, IEE.
-
J.W. Scannell, C. Blakemore and M.P. Young.
Analysis of connectivity in the cat cerebral cortex.
Journal of Neuroscience, 15(2):1463-1483,1995.
Kimia
- MATLAB data: kimia.mat
- Description:
Symmetric modified Hausdorff distances between binary shape images as used in
Pekalska et al. 2000. They
produced two matrices of 72x72 samples of 6 classes each 12 samples.
These classes again were chosen by a larger dataset of 18 classes stemming
from Kimia and coworkers 2001.
- Note: kimia-1 and kimia-2 obviously correspond to notions B and A
used in one reference.
- Original data:
kimia_orig.tgz
The original data was provided kindly by E. Pekalska who again
refers to B.B.Kimia. It consists of
the image data, the larger set of 18 classes and the non-symmetric
distances on which the symmetric one is based.
- References:
-
E. Pekalska, P. Paclik and R. Duin.
A Generalized Kernel Approach to Dissimilarity Based Classification.
Journal of Machine Learning Research, 2:175-211,2001.
-
T.B. Sebastian, P.N. Klein and B.B. Kimia.
Recognition of Shapes by Editing Shock Graphs. In Proc. ICCV 2001,
pp. 755--762, 2001.
UNIPEN-DTW
- MATLAB data: UNIPEN-DTW.mat
- Description:
For LOO experiments only a small fraction of the huge UNIPEN project (Guyon et al. 1994, UNIPEN-site)
was used, specifically a part of the 1c section of the Train-R01/V07
database, which contains lower case characters. The dissimilarity applied was
the Dynamic Time Warping distance as used by Bahlmann et al. 2002.
For each of the two matrices, we drew randomly 50 samples of each of
the 5 classes 'a' to 'e' from the complete 61K samples database.
- Original data:
The original data is not free, please refer to the project site
UNIPEN-site.
- References:
-
C. Bahlmann, B. Haasdonk and H. Burkhardt.
On-line Handwriting Recognition with Support Vector
Machines---A Kernel Approach. In Proc. of the 8th IWFHR 2002,
pp. 49-54, 2002.
-
I. Guyon, L. Schomaker, R. Plamondon,
M. Liberman and S. Janet.
UNIPEN project of on-line data exchange and
recognizer benchmarks. In Proc. 12th ICPR,
pp. 29-33, IEEE, 1994.
USPS-TD
- MATLAB data: USPS-TD.mat
- Description:
For the USPS training samples 1-250, 251-500, 501-750 and 751-1000
the 4 two-sided tangent distance matrices were computed using the
tangent
distance implementation provided by D. Keysers (Keysers et al. 2004)
which was also applied in Haasdonk et al. 2002.
For obtaining binary classification problems,
the digits 0-4 are here assigned to class 1, the digits 5-9 to class 2.
- Note: the USPS set is meanwhile quite easily handled as a whole
by state-of-the-art hardware. So the decomposition of the set into small
pieces is suboptimal. It was required for us due to LOO experiments
as for the other datasets.
- Original data:
The original USPS data can be accessed from
ftp at
MPI Tübingen. The data format obtained from there can not easily be
used in matlab. A small conversion routine for obtaining matlab *.MAT files is
USPS2matlab.m .
- References:
-
B. Haasdonk and D. Keysers.
Tangent Distance Kernels for Support Vector Machines. In Proc.
of the 16th Int. Conf. on Pattern Recognition, vol. 2, pp. 864-868, IEEE,
2002.
-
D. Keysers, W. Macherey, H. Ney, and J. Dahmen. Adaptation in Statistical
Pattern Recognition Using Tangent Vectors. In IEEE Transactions on Pattern
Analysis and Machine Intelligence, Volume 26, Number 2, pages 269-274,
February 2004.
Music-EMD/PTD
- MATLAB data: music-EMD.mat,
music-PTD.mat
- Description:
Both files contain distances between music incipits, measures by the Earth
Mover's Distance (EMD) and the Proportional Transportation Distance
(PTD) as used by Typke et al. 2003.
In each file two distance matrices are contained corresponding to 2
identical sets of binary classification problems (labels corresponding to
composer). The first set consists of
22 (Georg Friedrich Händel) + 28 (Joseph Haydn) pieces, the second of
27 (Wolfgang Amadeus Mozart) + 20 (Gottfried Preyer) pieces.
The distances were provided by courtesy of R. Typke, based on the Orpheus
system available at
http://give-lab.cs.uu.nl/orpheus
. We scaled them (from the original data given below) uniformly to
be in a reasonable range.
- Note:
The composers do not really form clusters in the distance space. So these
distances are perhaps better used without the author labels.
- Original data:
EMD_orig.tgz,
PTD_orig.tgz
and labels/index correspondences in
composers.txt.
The two binary melody sets are part of a much larger set of 30 melodies
times 16 composers.
After removing of duplicates and selecting two binary combinations, we however
only used the two sets as given above. Clearly these large matrices provide
much more valuable information.
- References:
-
R. Typke, P. Giannopoulos, R.C. Veltkamp, F. Wiering and R. van Oostrum.
Using transportation distances for measuring melodic similarity.
In Proc. ISMIR 2003, pp. 107--114, 2003.
Last modified on Thu Feb 17 19:00 2005 by B. Haasdonk