The unbalancing effect of hubs on K-medoids clustering in high-dimensional spaces

Unbalanced cluster solutions are affected by very different cluster sizes, with some clusters being very large while others contain almost no data. We demonstrate that this phenomenon is connected to `hubness', a recently discovered general problem of machine learning in high dimensional data spaces. Hub objects have a small distance to an exceptionally large number of data points, and anti-hubs are far from all other data points. In an empirical study of K-medoids clustering we show that hubness gives rise to very unbalanced cluster sizes resulting in impaired internal and external evaluation indices. We compare three methods which reduce hubness in the distance spaces and show that with the balancing of the clusters evaluation indices improve. This is done using artificial and real data sets from diverse domains.

[1]  Dunja Mladenic,et al.  Hubness-Aware Shared Neighbor Distances for High-Dimensional k-Nearest Neighbor Classification , 2012, HAIS.

[2]  Joydeep Ghosh,et al.  On Scaling Up Balanced Clustering Algorithms , 2002, SDM.

[3]  Peter J. Bickel,et al.  Maximum Likelihood Estimation of Intrinsic Dimension , 2004, NIPS.

[4]  Arthur Flexer,et al.  Using mutual proximity for novelty detection in audio music similarity , 2013 .

[5]  Horst Bunke,et al.  Validation indices for graph clustering , 2003, Pattern Recognit. Lett..

[6]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[7]  Alexandros Nanopoulos,et al.  Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data , 2010, J. Mach. Learn. Res..

[8]  Kenji Fukumizu,et al.  Localized Centering: Reducing Hubness in Large-Sample Data , 2015, AAAI.

[9]  Arthur Flexer,et al.  Choosing the Metric in High-Dimensional Spaces Based on Hub Analysis , 2014, ESANN.

[10]  Arthur Flexer,et al.  Choosing ℓp norms in high-dimensional spaces based on hub analysis , 2015, Neurocomputing.

[11]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[12]  Arthur Flexer,et al.  Can Shared Nearest Neighbors Reduce Hubness in High-Dimensional Spaces? , 2013, 2013 IEEE 13th International Conference on Data Mining Workshops.

[13]  Ayhan Demiriz,et al.  Constrained K-Means Clustering , 2000 .

[14]  Hans-Peter Kriegel,et al.  Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? , 2010, SSDBM.

[15]  Joydeep Ghosh,et al.  Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres , 2004, IEEE Transactions on Neural Networks.

[16]  Daling Wang,et al.  MMPClust: A Skew Prevention Algorithm for Model-Based Document Clustering , 2005, DASFAA.

[17]  Kristin P. Bennett,et al.  Density-based indexing for approximate nearest-neighbor queries , 1999, KDD '99.

[18]  Joydeep Ghosh,et al.  Scalable, Balanced Model-based Clustering , 2003, SDM.

[19]  Dunja Mladenic,et al.  The Role of Hubness in Clustering High-Dimensional Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[20]  Markus Schedl,et al.  On the Use of Microblogging Posts for Similarity Estimation and Artist Labeling , 2010, ISMIR.

[21]  Dunja Mladenic,et al.  Hubness-aware shared neighbor distances for high-dimensional $$k$$-nearest neighbor classification , 2014, Knowledge and Information Systems.

[22]  Arthur Flexer,et al.  A MIREX Meta-analysis of Hubness in Audio Music Similarity , 2012, ISMIR.

[23]  Markus Schedl,et al.  Local and global scaling reduce hubs in space , 2012, J. Mach. Learn. Res..

[24]  Arthur Flexer,et al.  A Case for Hubness Removal in High-Dimensional Multimedia Retrieval , 2014, ECIR.

[25]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[26]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[27]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[28]  Michel Verleysen,et al.  The Concentration of Fractional Distances , 2007, IEEE Transactions on Knowledge and Data Engineering.

[29]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[30]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[31]  Joydeep Ghosh,et al.  Model-based clustering with soft balancing , 2003, Third IEEE International Conference on Data Mining.

[32]  Yuji Matsumoto,et al.  Investigating the Effectiveness of Laplacian-Based Kernels in Hub Reduction , 2012, AAAI.

[33]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[34]  Pasi Fränti,et al.  Balanced K-Means for Clustering , 2014, S+SSPR.