Distance metrics for high dimensional nearest neighborhood recovery: Compression and normalization

Previous work has shown that the Minkowski-p distance metrics are unsuitable for clustering very high dimensional document data. We extend this work. We frame statistical theory on the relationships between the Euclidean, cosine, and correlation distance metrics in terms of item neighborhoods. We discuss the differences between the cosine and correlation distance metrics and illustrate our discussion with an example from collaborative filtering. We introduce a family of normalized Minkowski metrics and test their use on both document data and synthetic data generated from the uniform distribution. We describe a range of criteria for testing neighborhood homogeneity relative to underlying latent classes. We discuss how these criteria are explicitly and implicitly linked to classification performance. By testing both normalized and non-normalized Minkowski-p metrics for multiple values of p, we separate out distance compression effects from normalization effects. For multi-class classification problems, we believe that distance compression on high dimensional data aids classification and data analysis. For document data, we find that the cosine (and normalized Euclidean), correlation, and proportioned city block metrics give strong neighborhood recovery. The proportioned city block metric gives particularly good results for nearest neighbors recovery and should be used when utilizing document data analysis techniques for which nearest neighborhood recovery is important. For data generated from the uniform distribution, neighborhood recovery improves as the value of p increases.

[1]  R. Fisher FREQUENCY DISTRIBUTION OF THE VALUES OF THE CORRELATION COEFFIENTS IN SAMPLES FROM AN INDEFINITELY LARGE POPU;ATION , 1915 .

[2]  Wagner A. Kamakura,et al.  Defection Detection: Measuring and Understanding the Predictive Accuracy of Customer Churn Models , 2006 .

[3]  Michel Verleysen,et al.  The Concentration of Fractional Distances , 2007, IEEE Transactions on Knowledge and Data Engineering.

[4]  Sanjoy Dasgupta,et al.  Adaptive Control Processes , 2010, Encyclopedia of Machine Learning and Data Mining.

[5]  J. Marron,et al.  The high-dimension, low-sample-size geometric representation holds under mild conditions , 2007 .

[6]  Shamik Sural,et al.  Similarity between Euclidean and cosine angle distance for nearest neighbor queries , 2004, SAC '04.

[7]  Philip M. Dixon,et al.  Bootstrapping the gini coefficient of inequality , 1987 .

[8]  John Riedl,et al.  Item-based collaborative filtering recommendation algorithms , 2001, WWW '01.

[9]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[10]  L. Cronbach,et al.  Assessing similarity between profiles. , 1953, Psychological bulletin.

[11]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[12]  Benjamin C. M. Fung,et al.  A unified data mining solution for authorship analysis in anonymous textual communications , 2013, Inf. Sci..

[13]  Michel Verleysen,et al.  On the Effects of Dimensionality on Data Analysis with Neural Networks , 2009, IWANN.

[14]  Wei Liu,et al.  Semi-supervised distance metric learning for Collaborative Image Retrieval , 2008, CVPR.

[15]  Hsinchun Chen,et al.  Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums , 2008, TOIS.

[16]  V. A. Epanechnikov Non-Parametric Estimation of a Multivariate Probability Density , 1969 .

[17]  Charu C. Aggarwal,et al.  Towards systematic design of distance functions for data mining applications , 2003, KDD '03.

[18]  R. Bellman,et al.  V. Adaptive Control Processes , 1964 .

[19]  L. Xie,et al.  On the effectiveness of subwords for lexical cohesion based story segmentation of Chinese broadcast news , 2011, Inf. Sci..

[20]  Neil Davey,et al.  Non-Euclidean norms and data normalisation , 2004, ESANN.

[21]  Fionn Murtagh,et al.  The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering , 2008, J. Classif..

[22]  Toshio Odanaka,et al.  ADAPTIVE CONTROL PROCESSES , 1990 .

[23]  Janyce Wiebe,et al.  Learning Subjective Language , 2004, CL.

[24]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[25]  Jesper W. Schneider,et al.  Matrix comparison, Part 1: Motivation and important issues for measuring the resemblance between proximity measures or ordination results , 2007 .

[26]  Kenny Q. Ye Orthogonal Column Latin Hypercubes and Their Application in Computer Experiments , 1998 .

[27]  Ronald A. Cole,et al.  Spoken Letter Recognition , 1990, HLT.

[28]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[29]  Li Yang Locally Multidimensional Scaling for Nonlinear Dimensionality Reduction , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[30]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[31]  Fionn Murtagh,et al.  On Ultrametricity, Data Coding, and Computation , 2004, J. Classif..

[32]  A. Buja,et al.  Local Multidimensional Scaling for Nonlinear Dimension Reduction, Graph Drawing, and Proximity Analysis , 2009 .

[33]  Radko Mesiar,et al.  Aggregation functions: Construction methods, conjunctive, disjunctive and mixed classes , 2011, Inf. Sci..

[34]  Damien Franois High-dimensional Data Analysis: From Optimal Metrics to Feature Selection , 2008 .

[35]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[36]  Yoon Ho Cho,et al.  Collaborative filtering with ordinal scale-based implicit ratings for mobile music recommendations , 2010, Inf. Sci..

[37]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[38]  Radko Mesiar,et al.  Aggregation functions: Means , 2011, Inf. Sci..

[39]  J. S. Marron,et al.  Geometric representation of high dimension, low sample size data , 2005 .

[40]  Françoise Fessant,et al.  Designing Specific Weighted Similarity Measures to Improve Collaborative Filtering Systems , 2008, ICDM.

[41]  John Riedl,et al.  An Empirical Analysis of Design Choices in Neighborhood-Based Collaborative Filtering Algorithms , 2002, Information Retrieval.

[42]  J. Douglas Carroll,et al.  PARAMAP vs. Isomap: A Comparison of Two Nonlinear Mapping Algorithms , 2006, J. Classif..

[43]  Taghi M. Khoshgoftaar,et al.  A Survey of Collaborative Filtering Techniques , 2009, Adv. Artif. Intell..

[44]  Niklas Carlsson,et al.  Server selection in large-scale video-on-demand systems , 2010, TOMCCAP.

[45]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[46]  C. Gini Measurement of Inequality of Incomes , 1921 .

[47]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[48]  Stevan M. Berber,et al.  A General Rate K/N Convolutional Decoder Based on Neural Networks with Stopping Criterion , 2009, Adv. Artif. Intell..

[49]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[50]  Hsinchun Chen,et al.  Selecting Attributes for Sentiment Classification Using Feature Relation Networks , 2011, IEEE Transactions on Knowledge and Data Engineering.

[51]  P. Green,et al.  Analyzing multivariate data , 1978 .

[52]  Ophir Frieder,et al.  Repeatable evaluation of search services in dynamic environments , 2007, TOIS.

[53]  J. Douglas Carroll,et al.  Is the Distance Compression Effect Overstated? Some Theory and Experimentation , 2009, MLDM.

[54]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[55]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.