The Role of Hubness in Clustering High-Dimensional Data

High-dimensional data arise naturally in many domains, and have regularly presented a great challenge for traditional data mining techniques, both in terms of effectiveness and efficiency. Clustering becomes difficult due to the increasing sparsity of such data, as well as the increasing difficulty in distinguishing distances between data points. In this paper, we take a novel perspective on the problem of clustering high-dimensional data. Instead of attempting to avoid the curse of dimensionality by observing a lower dimensional feature subspace, we embrace dimensionality by taking advantage of inherently high-dimensional phenomena. More specifically, we show that hubness, i.e., the tendency of high-dimensional data to contain points (hubs) that frequently occur in k-nearest-neighbor lists of other points, can be successfully exploited in clustering. We validate our hypothesis by demonstrating that hubness is a good measure of point centrality within a high-dimensional data cluster, and by proposing several hubness-based clustering algorithms, showing that major hubs can be used effectively as cluster prototypes or as guides during the search for centroid-based cluster configurations. Experimental results demonstrate good performance of our algorithms in multiple settings, particularly in the presence of large quantities of noise. The proposed methods are tailored mostly for detecting approximately hyperspherical clusters and need to be extended to properly handle clusters of arbitrary shapes.

[1]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[2]  Riccardo Poli,et al.  New ideas in optimization , 1999 .

[3]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[4]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[5]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[6]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[7]  Hans-Peter Kriegel,et al.  Ranking Interesting Subspaces for Clustering High Dimensional Data , 2003, PKDD.

[8]  L. Buydens,et al.  Knn density-based clustering for high dimensional multispectral images , 2003, 2003 2nd GRSS/ISPRS Joint Workshop on Remote Sensing and Data Fusion over Urban Areas.

[9]  Fred A. Hamprecht,et al.  Efficient Density Clustering Using Basin Spanning Trees , 2003 .

[10]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[11]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.

[12]  François Pachet,et al.  Improving Timbre Similarity : How high’s the sky ? , 2004 .

[13]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[14]  Eric J. Pauwels,et al.  Shape-Invariant Cluster Validity Indices , 2004, Industrial Conference on Data Mining.

[15]  Chris H. Q. Ding,et al.  K-nearest-neighbor consistency in data clustering: incorporating local information into global optimization , 2004, SAC '04.

[16]  Hans-Peter Kriegel,et al.  A generic framework for efficient subspace clustering of high-dimensional data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[17]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[18]  Jiawei Han,et al.  Data Mining: Concepts and Techniques, Second Edition , 2006, The Morgan Kaufmann series in data management systems.

[19]  Eneko Agirre,et al.  Two graph-based algorithms for state-of-the-art WSD , 2006, EMNLP.

[20]  Michael Q. Zhang,et al.  Neighbor number, valley seeking and clustering , 2007, Pattern Recognit. Lett..

[21]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[22]  Michel Verleysen,et al.  The Concentration of Fractional Distances , 2007, IEEE Transactions on Knowledge and Data Engineering.

[23]  Deniz Yuret,et al.  Locally Scaled Density Based Clustering , 2007, ICANNGA.

[24]  J. Douglas Carroll,et al.  Is the Distance Compression Effect Overstated? Some Theory and Experimentation , 2009, MLDM.

[25]  Yousef Saad,et al.  Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection , 2009, J. Mach. Learn. Res..

[26]  Ata Kabán,et al.  When is 'nearest neighbour' meaningful: A converse theorem and implications , 2009, J. Complex..

[27]  Ira Assent,et al.  Evaluating Clustering in Subspace Projections of High Dimensional Data , 2009, Proc. VLDB Endow..

[28]  Alexandros Nanopoulos,et al.  How does high dimensionality affect collaborative filtering? , 2009, RecSys '09.

[29]  Zhongfei Zhang,et al.  Multimedia Data Mining , 2010, Data Mining and Knowledge Discovery Handbook.

[30]  Alexandros Nanopoulos,et al.  Time-Series Classification in Many Intrinsic Dimensions , 2010, SDM.

[31]  Hon Wai Leong,et al.  Examination of the relationship between essential genes in PPI network and hub proteins in reverse nearest neighbor topology , 2010, BMC Bioinformatics.

[32]  Alexandros Nanopoulos,et al.  On the existence of obstinate results in vector space models , 2010, SIGIR.

[33]  MuDer Jeng,et al.  Fast agglomerative clustering using information of k-nearest neighbors , 2010, Pattern Recognit..

[34]  Alexandros Nanopoulos,et al.  Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data , 2010, J. Mach. Learn. Res..

[35]  Dunja Mladenic,et al.  Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification , 2011, International Journal of Machine Learning and Cybernetics.

[36]  Lars Schmidt-Thieme,et al.  INSIGHT: Efficient and Effective Instance Selection for Time-Series Classification , 2011, PAKDD.

[37]  Dunja Mladenic,et al.  A probabilistic approach to nearest-neighbor classification: naive hubness bayesian kNN , 2011, CIKM '11.

[38]  Dunja Mladenic,et al.  The influence of hubness on nearest-neighbor methods in object recognition , 2011, 2011 IEEE 7th International Conference on Intelligent Computer Communication and Processing.

[39]  Dunja Mladenic,et al.  Nearest neighbor voting in high dimensional data: Learning from past occurrences , 2012, Comput. Sci. Inf. Syst..

[40]  Ata Kabán,et al.  Non-parametric detection of meaningless distances in high dimensional data , 2011, Statistics and Computing.

[41]  Srinivasan Parthasarathy,et al.  Bayesian Locality Sensitive Hashing for Fast Similarity Search , 2011, Proc. VLDB Endow..

[42]  Markus Schedl,et al.  Local and global scaling reduce hubs in space , 2012, J. Mach. Learn. Res..

[43]  Dunja Mladenic,et al.  The Role of Hubness in Clustering High-Dimensional Data , 2014, IEEE Trans. Knowl. Data Eng..

[44]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .