Is the Distance Compression Effect Overstated? Some Theory and Experimentation

Previous work in the document clustering literature has shown that the Minkowski-p distance metrics are unsuitable for clustering very high dimensional document data. This unsuitability is put down to the effect of "compression" of the distances created using the Minkowski-p metrics on high dimensional data. Previous experimental work on distance compression has generally used the performance of clustering algorithms on distances created by the different distance metrics as a proxy for the quality of the distance representations created by those metrics. In order to separate out the effects of distances from the performance of the clustering algorithms we tested the homogeneity of the latent classes with respect to item neighborhoods rather than testing the homogeneity of clustering solutions with respect to latent classes. We show the theoretical relationships between the cosine, correlation, and Euclidean metrics. We posit that some of the performance differential between the cosine and correlation metrics and the Minkowski-p metrics is due to the inbuilt normalization of the cosine and correlation metrics. The normalization effect decreases with increasing dimensionality and the distance compression effect increases with increasing dimensionality. For document datasets with dimensionality up to 20,000, the normalization effect dominates the distance compression effect. We propose a methodology for measuring the relative normalization and distance compression effects.

[1]  Michel Verleysen,et al.  The Concentration of Fractional Distances , 2007, IEEE Transactions on Knowledge and Data Engineering.

[2]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[3]  David Starer,et al.  Artificial Neural Nets , 1995 .

[4]  H. Scheffé,et al.  The Analysis of Variance , 1960 .

[5]  Ronald A. Cole,et al.  Spoken Letter Recognition , 1990, HLT.

[6]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[7]  Wagner A. Kamakura,et al.  Defection Detection: Measuring and Understanding the Predictive Accuracy of Customer Churn Models , 2006 .

[8]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[9]  Michel Verleysen,et al.  On the Effects of Dimensionality on Data Analysis with Neural Networks , 2009, IWANN.

[10]  Dieter Fensel,et al.  Problem-Solving Methods , 2001, Lecture Notes in Computer Science.

[11]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[12]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[13]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[14]  C. Gini Measurement of Inequality of Incomes , 1921 .

[15]  Malcolm P. Atkinson,et al.  Issues Raised by Three Years of Developing PJama: An Orthogonally Persistent Platform for Java , 1999, ICDT.

[16]  David H. Krantz,et al.  The dimensional representation and the metric structure of similarity data , 1970 .