Due to various requests, this page will provide the experimental data as used in the paper

Haasdonk, B., Bahlmann, C. Learning with Distance Substitution Kernels.

Additional information and larger datasets are occasionally available. Please note that most of the data was provided by other parties, as indicated in corresponding references.

**MATLAB data:**proteins.mat**Description:**

Distance matrix of evolutionary distances of 226 proteins. One distance matrix with four different binary labelings (one vs rest) corresponding to four major classes are given. The classes are defined by the initial two characters of their protein codes in the original dataset. class 1 (HA, 72 samples), class 2 (HB, 72 samples), class 3 (MY, 39 samples), class 4 (GG,GP, 30 samples). The remaining samples (HZ,HD,HG,HE,HF, 13 samples) are labeled as rest.**Note:**Some references obviously use a different labelling, as the class "M", which supposedly corresponds to our class 3, is reported to consist of 37 samples there.**Original data:**In Graepel et al. 1999, the original data including the complete chemical labels (ascii-format) is stated to origin from M. Vingron and T. Hofmann. As we lack explicit permission, we do not post these sets on this page, but can provide it individually by email.**References:**

- T. Graepel, R. Herbrich, P. Bollmann-Sdorra and K. Obermayer. Classification on pairwise proximity data. In Adv. Neural Info. Proc. Syst., volume 11, pages 438-444, Cambridge, MA, 1999.MIT Press.
- T. Graepel, R. Herbrich, B. Schölkopf, A. Smola, P. Bartlett, K.-R. Müller, K. Obermayer and B. Williamson. Classification on proximity data with LP-machines. In Proc. 9th ICANN, pp. 304-309, London, 1999, IEE.

**MATLAB data:**cat-cortex.mat**Description:**

Distance matrix of connection strengths between 65 regions of the cat's celebral cortex and 4 different binary labellings of four functional classes. Class 1 (visual,V, 18 regions), class 2 (auditory, A, 10 regions), class 3 (somatosensory, S, 18 regions) and class 4 (frontolimbic, F, 19 regions). The distances are ranging from 0 to 4 in 0.5 steps resulting from pairwise averaging the original non-symmetric distances.**Note:**Some references obviously use a different labelling, as the class V and S is reported to consist of one more resp. less samples there.**Original data:**The original data (ascii-format) can be provided on request. We did not obtain permission for web-publishing up to now.**References:**

- T. Graepel, R. Herbrich, P. Bollmann-Sdorra and K. Obermayer. Classification on pairwise proximity data. In Adv. Neural Info. Proc. Syst., volume 11, pages 438-444, Cambridge, MA, 1999.MIT Press.
- T. Graepel, R. Herbrich, B. Schölkopf, A. Smola, P. Bartlett, K.-R. Müller, K. Obermayer and B. Williamson. Classification on proximity data with LP-machines. In Proc. 9th ICANN, pp. 304-309, London, 1999, IEE.
- J.W. Scannell, C. Blakemore and M.P. Young. Analysis of connectivity in the cat cerebral cortex. Journal of Neuroscience, 15(2):1463-1483,1995.

**MATLAB data:**kimia.mat**Description:**

Symmetric modified Hausdorff distances between binary shape images as used in Pekalska et al. 2000. They produced two matrices of 72x72 samples of 6 classes each 12 samples. These classes again were chosen by a larger dataset of 18 classes stemming from Kimia and coworkers 2001.**Note:**kimia-1 and kimia-2 obviously correspond to notions B and A used in one reference.**Original data:**kimia_orig.tgz The original data was provided kindly by E. Pekalska who again refers to B.B.Kimia. It consists of the image data, the larger set of 18 classes and the non-symmetric distances on which the symmetric one is based.**References:**

- E. Pekalska, P. Paclik and R. Duin. A Generalized Kernel Approach to Dissimilarity Based Classification. Journal of Machine Learning Research, 2:175-211,2001.
- T.B. Sebastian, P.N. Klein and B.B. Kimia. Recognition of Shapes by Editing Shock Graphs. In Proc. ICCV 2001, pp. 755--762, 2001.

**MATLAB data:**UNIPEN-DTW.mat**Description:**

For LOO experiments only a small fraction of the huge UNIPEN project (Guyon et al. 1994, UNIPEN-site) was used, specifically a part of the 1c section of the Train-R01/V07 database, which contains lower case characters. The dissimilarity applied was the Dynamic Time Warping distance as used by Bahlmann et al. 2002. For each of the two matrices, we drew randomly 50 samples of each of the 5 classes 'a' to 'e' from the complete 61K samples database.**Original data:**The original data is not free, please refer to the project site UNIPEN-site.**References:**

- C. Bahlmann, B. Haasdonk and H. Burkhardt. On-line Handwriting Recognition with Support Vector Machines---A Kernel Approach. In Proc. of the 8th IWFHR 2002, pp. 49-54, 2002.
- I. Guyon, L. Schomaker, R. Plamondon, M. Liberman and S. Janet. UNIPEN project of on-line data exchange and recognizer benchmarks. In Proc. 12th ICPR, pp. 29-33, IEEE, 1994.

**MATLAB data:**USPS-TD.mat**Description:**

For the USPS training samples 1-250, 251-500, 501-750 and 751-1000 the 4 two-sided tangent distance matrices were computed using the tangent distance implementation provided by D. Keysers (Keysers et al. 2004) which was also applied in Haasdonk et al. 2002. For obtaining binary classification problems, the digits 0-4 are here assigned to class 1, the digits 5-9 to class 2.**Note:**the USPS set is meanwhile quite easily handled as a whole by state-of-the-art hardware. So the decomposition of the set into small pieces is suboptimal. It was required for us due to LOO experiments as for the other datasets.**Original data:**The original USPS data can be accessed from ftp at MPI Tübingen. The data format obtained from there can not easily be used in matlab. A small conversion routine for obtaining matlab *.MAT files is USPS2matlab.m .**References:**

- B. Haasdonk and D. Keysers. Tangent Distance Kernels for Support Vector Machines. In Proc. of the 16th Int. Conf. on Pattern Recognition, vol. 2, pp. 864-868, IEEE, 2002.
- D. Keysers, W. Macherey, H. Ney, and J. Dahmen. Adaptation in Statistical Pattern Recognition Using Tangent Vectors. In IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 26, Number 2, pages 269-274, February 2004.

**MATLAB data:**music-EMD.mat, music-PTD.mat**Description:**

Both files contain distances between music incipits, measures by the Earth Mover's Distance (EMD) and the Proportional Transportation Distance (PTD) as used by Typke et al. 2003. In each file two distance matrices are contained corresponding to 2 identical sets of binary classification problems (labels corresponding to composer). The first set consists of 22 (Georg Friedrich Händel) + 28 (Joseph Haydn) pieces, the second of 27 (Wolfgang Amadeus Mozart) + 20 (Gottfried Preyer) pieces. The distances were provided by courtesy of R. Typke, based on the Orpheus system available at http://give-lab.cs.uu.nl/orpheus . We scaled them (from the original data given below) uniformly to be in a reasonable range.**Note:**The composers do not really form clusters in the distance space. So these distances are perhaps better used without the author labels.**Original data:**EMD_orig.tgz, PTD_orig.tgz and labels/index correspondences in composers.txt. The two binary melody sets are part of a much larger set of 30 melodies times 16 composers. After removing of duplicates and selecting two binary combinations, we however only used the two sets as given above. Clearly these large matrices provide much more valuable information.**References:**

- R. Typke, P. Giannopoulos, R.C. Veltkamp, F. Wiering and R. van Oostrum. Using transportation distances for measuring melodic similarity. In Proc. ISMIR 2003, pp. 107--114, 2003.