A comprehensive empirical comparison of hubness reduction in high-dimensional spaces

Hubness is an aspect of the curse of dimensionality related to the distance concentration effect. Hubs occur in high-dimensional data spaces as objects that are particularly often among the nearest neighbors of other objects. Conversely, other data objects become antihubs, which are rarely or never nearest neighbors to other objects. Many machine learning algorithms rely on nearest neighbor search and some form of measuring distances, which are both impaired by high hubness. Degraded performance due to hubness has been reported for various tasks such as classification, clustering, regression, visualization, recommendation, retrieval and outlier detection. Several hubness reduction methods based on different paradigms have previously been developed. Local and global scaling as well as shared neighbors approaches aim at repairing asymmetric neighborhood relations. Global and localized centering try to eliminate spatial centrality, while the related global and local dissimilarity measures are based on density gradient flattening. Additional methods and alternative dissimilarity measures that were argued to mitigate detrimental effects of distance concentration also influence the related hubness phenomenon. In this paper, we present a large-scale empirical evaluation of all available unsupervised hubness reduction methods and dissimilarity measures. We investigate several aspects of hubness reduction as well as its influence on data semantics which we measure via nearest neighbor classification. Scaling and density gradient flattening methods improve evaluation measures such as hubness and classification accuracy consistently for data sets from a wide range of domains, while centering approaches achieve the same only under specific settings.

[1]  Marco Saerens,et al.  Centering Similarity Measures to Reduce Hubs , 2013, EMNLP.

[2]  Arthur Flexer,et al.  Mutual proximity graphs for improved reachability in music recommendation , 2017, Journal of new music research.

[3]  Arthur Flexer,et al.  Improving Visualization of High-Dimensional Music Similarity Spaces , 2015, ISMIR.

[4]  Adam Powell,et al.  The Origins of Lactase Persistence in Europe , 2009, PLoS Comput. Biol..

[5]  Arthur Flexer,et al.  HUBNESS-AWARE OUTLIER DETECTION FOR MUSIC GENRE RECOGNITION , 2016 .

[6]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[7]  Arthur Flexer,et al.  Centering Versus Scaling for Hubness Reduction , 2016, ICANN.

[8]  Alexandr Andoni,et al.  Practical and Optimal LSH for Angular Distance , 2015, NIPS.

[9]  Peter Knees,et al.  Improving Neighborhood-Based Collaborative Filtering by Reducing Hubness , 2014, ICMR.

[10]  Brian Kulis,et al.  Metric Learning: A Survey , 2013, Found. Trends Mach. Learn..

[11]  H. A. Guvenir,et al.  A supervised machine learning algorithm for arrhythmia analysis , 1997, Computers in Cardiology 1997.

[12]  Peter J. Bickel,et al.  Maximum Likelihood Estimation of Intrinsic Dimension , 2004, NIPS.

[13]  Dunja Mladenic,et al.  The influence of hubness on nearest-neighbor methods in object recognition , 2011, 2011 IEEE 7th International Conference on Intelligent Computer Communication and Processing.

[14]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[15]  Arthur Flexer,et al.  Can Shared Nearest Neighbors Reduce Hubness in High-Dimensional Spaces? , 2013, 2013 IEEE 13th International Conference on Data Mining Workshops.

[16]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[17]  Zongkai Yang,et al.  Variable Length Character N-Gram Approach for Online Writeprint Identification , 2010, 2010 International Conference on Multimedia Information Networking and Security.

[18]  Arthur Flexer,et al.  The unbalancing effect of hubs on K-medoids clustering in high-dimensional spaces , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[19]  Alexandros Nanopoulos,et al.  Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection , 2015, IEEE Transactions on Knowledge and Data Engineering.

[20]  Fabio Pagnotta,et al.  USING DATA MINING TO PREDICT SECONDARY SCHOOL STUDENT ALCOHOL CONSUMPTION , 2016 .

[21]  Arthur Flexer,et al.  The relation of hubs to the Doddington zoo in speaker verification , 2013, 21st European Signal Processing Conference (EUSIPCO 2013).

[22]  Dunja Mladenic,et al.  The Role of Hubness in Clustering High-Dimensional Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[23]  Markus Schedl,et al.  Local and global scaling reduce hubs in space , 2012, J. Mach. Learn. Res..

[24]  Chris Mesterharm,et al.  Active learning using on-line algorithms , 2011, KDD.

[25]  Ricardo Chavarriaga,et al.  The Opportunity challenge: A benchmark database for on-body sensor-based activity recognition , 2013, Pattern Recognit. Lett..

[26]  Cheng Soon Ong,et al.  Multivariate spearman's ρ for aggregating ranks using copulas , 2016 .

[27]  Peter Kokol,et al.  Stability of Ranked Gene Lists in Large Microarray Analysis Studies , 2010, Journal of biomedicine & biotechnology.

[28]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[29]  Bernd Bischl,et al.  mlr: Machine Learning in R , 2016, J. Mach. Learn. Res..

[30]  Dunja Mladenic,et al.  Hubness-aware shared neighbor distances for high-dimensional $$k$$-nearest neighbor classification , 2014, Knowledge and Information Systems.

[31]  Antonino Staiano,et al.  Intrinsic dimension estimation: Advances and open problems , 2016, Inf. Sci..

[32]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[33]  Arthur Flexer An Empirical Analysis of Hubness in Unsupervised Distance-Based Outlier Detection , 2016, 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW).

[34]  Elias Oliveira,et al.  An Evolving System Based on Probabilistic Neural Network , 2010, 2010 Eleventh Brazilian Symposium on Neural Networks.

[35]  Michel Verleysen,et al.  The Concentration of Fractional Distances , 2007, IEEE Transactions on Knowledge and Data Engineering.

[36]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[37]  Cordelia Schmid,et al.  A contextual dissimilarity measure for accurate and efficient image search , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Alexandros Nanopoulos,et al.  Nearest neighbor regression in the presence of bad hubs , 2015, Knowl. Based Syst..

[39]  Fikret S. Gürgen,et al.  Collection and Analysis of a Parkinson Speech Dataset With Multiple Types of Sound Recordings , 2013, IEEE Journal of Biomedical and Health Informatics.

[40]  C. Borgelt,et al.  The Hubness Phenomenon: Fact or Artifact? , 2013 .

[41]  Arthur Flexer,et al.  Choosing ℓp norms in high-dimensional spaces based on hub analysis , 2015, Neurocomputing.

[42]  Emmanuel Vincent,et al.  An investigation of likelihood normalization for robust ASR , 2014, INTERSPEECH.

[43]  Arthur Flexer,et al.  A Case for Hubness Removal in High-Dimensional Multimedia Retrieval , 2014, ECIR.

[44]  Gholamreza Haffari,et al.  Data-dependent dissimilarity measure: an effective alternative to geometric distance measures , 2017, Knowledge and Information Systems.

[45]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[46]  Peter Kaiser,et al.  Predicting Positive p53 Cancer Rescue Regions Using Most Informative Positive (MIP) Active Learning , 2009, PLoS Comput. Biol..

[47]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[48]  Kenji Fukumizu,et al.  Reducing Hubness: A Cause of Vulnerability in Recommender Systems , 2015, SIGIR.

[49]  K. Cios,et al.  Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome , 2015, PloS one.

[50]  Nenad Tomasev Taming the Empirical Hubness Risk in Many Dimensions , 2015, SDM.

[51]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[52]  Markus Schedl,et al.  Using Mutual Proximity to Improve Content-Based Audio Similarity , 2011, ISMIR.

[53]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[54]  Kenji Fukumizu,et al.  Localized Centering: Reducing Hubness in Large-Sample Data , 2015, AAAI.

[55]  Kenji Fukumizu,et al.  Flattening the Density Gradient for Eliminating Spatial Centrality to Reduce Hubness , 2016, AAAI.

[56]  François Pachet,et al.  Improving Timbre Similarity : How high’s the sky ? , 2004 .

[57]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[58]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[59]  Alexandros Nanopoulos,et al.  Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data , 2010, J. Mach. Learn. Res..

[60]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[61]  Jung-Ying Wang,et al.  Application of Support Vector Machines in Bioinformatics , 2002 .