Local Intrinsic Dimensionality II: Multivariate Analysis and Distributional Support

Distance-based expansion models of intrinsic dimensionality have had recent application in the analysis of complexity of similarity applications, and in the design of efficient heuristics. This theory paper extends one such model, the local intrinsic dimension (LID), to a multivariate form that can account for the contributions of different distributional components towards the intrinsic dimensionality of the entire feature set, or equivalently towards the discriminability of distance measures defined in terms of these feature combinations. Formulas are established for the effect on LID under summation, product, composition, and convolution operations on smooth functions in general, and cumulative distribution functions in particular. For some of these operations, the dimensional or discriminability characteristics of the result are also shown to depend on a form of distributional support. As an example, an analysis is provided that quantifies the impact of introduced random Gaussian noise on the intrinsic dimension of data. Finally, a theoretical relationship is established between the LID model and the classical correlation dimension.

[1]  Michael E. Houle,et al.  Local Intrinsic Dimensionality I: An Extreme-Value-Theoretic Foundation for Similarity Applications , 2017, SISAP.

[2]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[3]  S. Coles,et al.  An Introduction to Statistical Modeling of Extreme Values , 2001 .

[4]  David R. Karger,et al.  Finding nearest neighbors in growth-restricted metrics , 2002, STOC '02.

[5]  P. Grassberger,et al.  Measuring the Strangeness of Strange Attractors , 1983 .

[6]  Kenneth L. Clarkson,et al.  Nearest Neighbor Queries in Metric Spaces , 1999, Discret. Comput. Geom..

[7]  Michael E. Houle,et al.  Rank-Based Similarity Search: Reducing the Dimensional Dependence , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Michael E. Houle,et al.  Dimensionality, Discriminability, Density and Distance Distributions , 2013, 2013 IEEE 13th International Conference on Data Mining Workshops.

[9]  Huan Liu,et al.  Spectral feature selection for supervised and unsupervised learning , 2007, ICML '07.

[10]  Alessandro Rozza,et al.  Novel high intrinsic dimensionality estimators , 2012, Machine Learning.

[11]  Sanjay Chawla,et al.  Density-preserving projections for large-scale local anomaly detection , 2012, Knowledge and Information Systems.

[12]  James Bailey,et al.  Measuring dependency via intrinsic dimensionality , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[13]  Peer Kröger,et al.  Dimensional Testing for Reverse k-Nearest Neighbor Search , 2017, Proc. VLDB Endow..

[14]  Ken-ichi Kawarabayashi,et al.  Estimating Local Intrinsic Dimensionality , 2015, KDD.

[15]  Vladimir Pestov,et al.  Indexability, concentration, and VC theory , 2010, J. Discrete Algorithms.

[16]  Michael E. Houle,et al.  Efficient similarity search within user-specified projective subspaces , 2016, Inf. Syst..

[17]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[18]  H. Hentschel,et al.  On the characterization of chaotic motions , 1983 .

[19]  Michael E. Houle,et al.  Improving k-NN Graph Accuracy Using Local Intrinsic Dimensionality , 2017, SISAP.

[20]  Hisashi Kashima,et al.  Generalized Expansion Dimension , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[21]  M. Gomes,et al.  Statistics of extremes for IID data and breakthroughs in the estimation of the extreme value index: Laurens de Haan leading contributions , 2008 .

[22]  Yury Lifshits,et al.  Disorder inequality: a combinatorial approach to nearest neighbor search , 2008, WSDM '08.

[23]  Y. Pesin On rigorous mathematical definitions of correlation dimension and generalized spectrum for dimensions , 1993 .

[24]  James Theiler,et al.  Lacunarity in a best estimator of fractal dimension , 1988 .

[25]  Xindong Wu,et al.  10 Challenging Problems in Data Mining Research , 2006, Int. J. Inf. Technol. Decis. Mak..

[26]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[27]  B. M. Hill,et al.  A Simple General Approach to Inference About the Tail of a Distribution , 1975 .

[28]  Michael E. Houle,et al.  Dimensional Testing for Multi-step Similarity Search , 2012, 2012 IEEE 12th International Conference on Data Mining.