The Hubness Phenomenon in High-Dimensional Spaces

High-dimensional data analysis is often negatively affected by the curse of dimensionality. In high-dimensional spaces, data becomes extremely sparse and distances between points become indistinguishable. As a consequence, reliable estimations of density, or meaningful distance-based similarity measures, cannot be obtained. This issue is particularly prevalent in clustering, which is commonly employed in exploratory data analysis. Another challenge for clustering high-dimensional data is that data often exist in subspaces consisting of combinations of dimensions, with different subspaces being relevant for different clusters. The hubness phenomenon is a recently discovered aspect of high-dimensional spaces. It is observed that the distribution of neighbor occurrences becomes skewed in intrinsically high-dimensional data, with few points, the hubs, having high occurrence counts. Hubness is observed to be more pronounced with increasing dimensionality. Hubs are also known to exhibit useful clustering properties and could be leveraged to mitigate the challenges in high-dimensional data analysis. In this chapter, we identify new geometric relationships between hubness, data density, and data distance distribution, as well as between hubness, subspaces, and intrinsic dimensionality of data. In addition, we formulate various potential research directions to leverage hubness for clustering and for subspace estimation.

[1]  Ira Assent,et al.  Evaluating Clustering in Subspace Projections of High Dimensional Data , 2009, Proc. VLDB Endow..

[2]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[3]  Antonino Staiano,et al.  Intrinsic dimension estimation: Advances and open problems , 2016, Inf. Sci..

[4]  Dunja Mladenic,et al.  The Role of Hubness in Clustering High-Dimensional Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[5]  Pierre Demartines Analyse de donnees par reseaux de neurones auto-organises , 1994 .

[6]  Alexandros Nanopoulos,et al.  Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data , 2010, J. Mach. Learn. Res..

[7]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[8]  I. Jolliffe Principal Component Analysis and Factor Analysis , 1986 .

[9]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[10]  P. Grassberger,et al.  Measuring the Strangeness of Strange Attractors , 1983 .

[11]  Leman Akoglu,et al.  Less is More , 2016, ACM Trans. Knowl. Discov. Data.

[12]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[13]  Robert P. W. Duin,et al.  An Evaluation of Intrinsic Dimensionality Estimators , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Dunja Mladenic,et al.  Hub Co-occurrence Modeling for Robust High-Dimensional kNN Classification , 2013, ECML/PKDD.

[15]  J. Harlim,et al.  Variable Bandwidth Diffusion Kernels , 2014, 1406.5064.

[16]  Michel Verleysen,et al.  Towards Advanced Data Analysis by Combining Soft Computing and Statistics , 2012, SOCO 2012.

[17]  Craig I. Watson,et al.  The myth of goats :: how many people have fingerprints that are hard to match? , 2005 .

[18]  Peter J. Bickel,et al.  Maximum Likelihood Estimation of Intrinsic Dimension , 2004, NIPS.

[19]  Michel Verleysen,et al.  The Concentration of Fractional Distances , 2007, IEEE Transactions on Knowledge and Data Engineering.

[20]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[21]  Dunja Mladenic,et al.  A probabilistic approach to nearest-neighbor classification: naive hubness bayesian kNN , 2011, CIKM '11.

[22]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[23]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.