Dimensionality, Discriminability, Density and Distance Distributions

For many large-scale applications in data mining, machine learning, and multimedia, fundamental operations such as similarity search, retrieval, classification, clustering, and anomaly detection generally suffer from an effect known as the `curse of dimensionality'. As the dimensionality of the data increases, distance values tend to become less discriminative, due to their increasing relative concentration about the mean of their distribution. For this reason, researchers have considered the analysis of structures and methods in terms of measures of the intrinsic dimensionality of the data sets. This paper is concerned with a generalization of a discrete measure of intrinsic dimensionality, the expansion dimension, to the case of continuous distance distributions. This notion of the intrinsic dimensionality of a distribution is shown to precisely coincide with a natural notion of the indiscriminability of distances and features. Furthermore, for any distance distribution with differentiable cumulative density function, a fundamental relationship is shown to exist between probability density, the cumulative density (cumulative probability divided by distance), intrinsic dimensionality, and discriminability.

[1]  Vladimir Pestov,et al.  On the geometry of similarity search: Dimensionality curse and concentration of measure , 1999, Inf. Process. Lett..

[2]  Robert Krauthgamer,et al.  Bounded geometries, fractals, and low-distortion embeddings , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[3]  Fei Liu,et al.  Feature Discriminability for Pattern Classification Based on Neural Incremental Attribute Learning , 2011 .

[4]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[5]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[6]  Michael E. Houle,et al.  Rank Cover Trees for Nearest Neighbor Search , 2013, SISAP.

[7]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[8]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[9]  Bill Ravens,et al.  An Introduction to Copulas , 2000, Technometrics.

[10]  Jane Labadin,et al.  Feature selection based on mutual information , 2015, 2015 9th International Conference on IT in Asia (CITA).

[11]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[12]  J. A. Lozano,et al.  Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation , 2001 .

[13]  Michael E. Houle,et al.  Dimensional Testing for Multi-step Similarity Search , 2012, 2012 IEEE 12th International Conference on Data Mining.

[14]  David R. Karger,et al.  Finding nearest neighbors in growth-restricted metrics , 2002, STOC '02.

[15]  Hans-Peter Kriegel,et al.  A survey on unsupervised outlier detection in high‐dimensional numerical data , 2012, Stat. Anal. Data Min..

[16]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[17]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[18]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[19]  Sanjay Chawla,et al.  Finding Local Anomalies in Very High Dimensional Space , 2010, 2010 IEEE International Conference on Data Mining.

[20]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[21]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[22]  Jarkko Venna,et al.  Local multidimensional scaling , 2006, Neural Networks.

[23]  Hans-Peter Kriegel,et al.  Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? , 2010, SSDBM.

[24]  Robert Krauthgamer,et al.  Navigating nets: simple algorithms for proximity search , 2004, SODA '04.

[25]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[27]  Hisashi Kashima,et al.  Generalized Expansion Dimension , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.