The Curse Revisited: a Newly Quantified Concept of Meaningful Distances for Learning from High-Dimensional Noisy Data

Distances between data points are widely used in point cloud representation learning. Yet, it is no secret that under the effect of noise, these distances—and thus the models based upon them—may lose their usefulness in high dimensions. Indeed, the small marginal effects of the noise may then accumulate quickly, shifting empirical closest and furthest neighbors away from the ground truth. In this paper, we characterize such effects in high-dimensional data using an asymptotic probabilistic expression. Furthermore, while it has been previously argued that neighborhood queries become meaningless and unstable when there is a poor relative discrimination between the furthest and closest point, we conclude that this is not necessarily the case when explicitly separating the ground truth data from the noise. More specifically, we derive that under particular conditions, empirical neighborhood relations affected by noise are still likely to be true even when we observe this discrimination to be poor. We include thorough empirical verification of our results, as well as experiments that interestingly show our derived phase shift where neighbors become random or not is identical to the phase shift where common dimensionality reduction methods perform poorly or well for finding low-dimensional representations of highdimensional data with dense noise.

[1]  Eamonn J. Keogh Nearest Neighbor , 2010, Encyclopedia of Machine Learning.

[2]  Yvan Saeys,et al.  A comparison of single-cell trajectory inference methods , 2019, Nature Biotechnology.

[3]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[4]  M. Kramer Nonlinear principal component analysis using autoassociative neural networks , 1991 .

[5]  Frances Y. Kuo,et al.  Lifting the Curse of Dimensionality , 2005 .

[6]  Chris Giannella,et al.  Instability results for Euclidean distance, nearest neighbor search on high dimensional Gaussian data , 2021, Inf. Process. Lett..

[7]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[8]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[9]  Ata Kabán,et al.  When is 'nearest neighbour' meaningful: A converse theorem and implications , 2009, J. Complex..

[10]  Yvan Saeys,et al.  Stable topological signatures for metric trees through graph approximations , 2021, Pattern Recognit. Lett..

[11]  Tijl De Bie,et al.  Mining Topological Structure in Graphs through Forest Representations , 2020, J. Mach. Learn. Res..

[12]  Ata Kabán,et al.  Non-parametric detection of meaningless distances in high dimensional data , 2011, Statistics and Computing.

[13]  Wei Keat Lim,et al.  Noise regularization removes correlation artifacts in single-cell RNA-seq data preprocessing , 2020, bioRxiv.

[14]  Michel Verleysen,et al.  The Curse of Dimensionality in Data Mining and Time Series Prediction , 2005, IWANN.

[15]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[16]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[17]  Y. Saeys,et al.  Computational methods for trajectory inference from single‐cell transcriptomics , 2016, European journal of immunology.

[18]  Jean-Michel Morel,et al.  A non-local algorithm for image denoising , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[19]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[20]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[21]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Pre-processing for noise detection in gene expression classification data , 2009, Journal of the Brazilian Computer Society.

[22]  J. Lindeberg Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeitsrechnung , 1922 .

[23]  Russell B. Fletcher,et al.  Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics , 2017, BMC Genomics.

[24]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[25]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[26]  Alexandros Nanopoulos,et al.  Nearest neighbors in high-dimensional data: the emergence and influence of hubs , 2009, ICML '09.

[27]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.