Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification

Most data of interest today in data-mining applications is complex and is usually represented by many different features. Such high-dimensional data is by its very nature often quite difficult to handle by conventional machine-learning algorithms. This is considered to be an aspect of the well known curse of dimensionality. Consequently, high-dimensional data needs to be processed with care, which is why the design of machine-learning algorithms needs to take these factors into account. Furthermore, it was observed that some of the arising high-dimensional properties could in fact be exploited in improving overall algorithm design. One such phenomenon, related to nearest-neighbor learning methods, is known as hubness and refers to the emergence of very influential nodes (hubs) in k-nearest neighbor graphs. A crisp weighted voting scheme for the k-nearest neighbor classifier has recently been proposed which exploits this notion. In this paper we go a step further by embracing the soft approach, and propose several fuzzy measures for k-nearest neighbor classification, all based on hubness, which express fuzziness of elements appearing in k-neighborhoods of other points. Experimental evaluation on real data from the UCI repository and the image domain suggests that the fuzzy approach provides a useful measure of confidence in the predicted labels, resulting in improvement over the crisp weighted method, as well as the standard kNN classifier.

[1]  Lotfi A. Zadeh,et al.  Fuzzy Sets , 1996, Inf. Control..

[2]  James M. Keller,et al.  A fuzzy K-nearest neighbor algorithm , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  J. M. Salceda,et al.  Fuzzy K-nearest neighbor classifiers for ventricular arrhythmia detection. , 1991, International journal of bio-medical computing.

[4]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[5]  Paul Scheunders,et al.  Genetic feature selection combined with composite fuzzy nearest neighbor classifiers for hyperspectral satellite imagery , 2002, Pattern Recognit. Lett..

[6]  N. Singpurwalla,et al.  Membership Functions and Probability Measures of Fuzzy Sets , 2004 .

[7]  François Pachet,et al.  Improving Timbre Similarity : How high’s the sky ? , 2004 .

[8]  Yoshua Bengio,et al.  Inference for the Generalization Error , 1999, Machine Learning.

[9]  Seung-Yeon Kim,et al.  Prediction of protein solvent accessibility using fuzzy k-nearest neighbor method , 2005, Bioinform..

[10]  Tuan D. Pham,et al.  An Optimally Weighted Fuzzy k-NN Algorithm , 2005, ICAPR.

[11]  Kuo-Chen Chou,et al.  Fuzzy KNN for predicting membrane protein types from pseudo-amino acid composition. , 2006, Journal of theoretical biology.

[12]  David Zhang,et al.  On kernel difference-weighted k-nearest neighbor classification , 2008, Pattern Analysis and Applications.

[13]  Michel Verleysen,et al.  The Concentration of Fractional Distances , 2007, IEEE Transactions on Knowledge and Data Engineering.

[14]  Shiow-Fen Hwang,et al.  Accurate prediction of enzyme subfamily class using an adaptive fuzzy k-nearest neighbor method , 2007, Biosyst..

[15]  Alexandros Nanopoulos,et al.  Nearest neighbors in high-dimensional data: the emergence and influence of hubs , 2009, ICML '09.

[16]  Yousef Saad,et al.  Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection , 2009, J. Mach. Learn. Res..

[17]  Ata Kabán,et al.  When is 'nearest neighbour' meaningful: A converse theorem and implications , 2009, J. Complex..

[18]  P. Viswanath,et al.  Rough-fuzzy weighted k-nearest leader classifier for large data sets , 2009, Pattern Recognit..

[19]  Zhongfei Zhang,et al.  Multimedia Data Mining , 2010, Data Mining and Knowledge Discovery Handbook.

[20]  Alexandros Nanopoulos,et al.  Time-Series Classification in Many Intrinsic Dimensions , 2010, SDM.

[21]  Alexandros Nanopoulos,et al.  On the existence of obstinate results in vector space models , 2010, SIGIR.

[22]  Kai Zheng,et al.  K-nearest neighbor search for fuzzy objects , 2010, SIGMOD Conference.

[23]  Hans-Peter Kriegel,et al.  Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? , 2010, SSDBM.

[24]  Alexandros Nanopoulos,et al.  Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data , 2010, J. Mach. Learn. Res..

[25]  Dunja Mladenic,et al.  Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification , 2011, International Journal of Machine Learning and Cybernetics.

[26]  Yu-Lin He,et al.  Particle swarm optimization for determining fuzzy measures from data , 2011, Inf. Sci..

[27]  Lars Schmidt-Thieme,et al.  INSIGHT: Efficient and Effective Instance Selection for Time-Series Classification , 2011, PAKDD.

[28]  Dunja Mladenic,et al.  The influence of hubness on nearest-neighbor methods in object recognition , 2011, 2011 IEEE 7th International Conference on Intelligent Computer Communication and Processing.

[29]  Dunja Mladenic,et al.  Nearest neighbor voting in high dimensional data: Learning from past occurrences , 2012, Comput. Sci. Inf. Syst..

[30]  Dunja Mladenic,et al.  The Role of Hubness in Clustering High-Dimensional Data , 2014, IEEE Trans. Knowl. Data Eng..