Redefining nearest neighbor classification in high-dimensional settings

Abstract In this work, a novel nearest neighbor approach is presented. The main idea is to redefine the distance metric in order to include only a subset of relevant variables, assuming that they are of equal importance for the classification model. Three different distance measures are redefined: the traditional squared Euclidean, the Manhattan, and the Chebyshev. These modifications are designed to improve classification performance in high-dimensional applications, in which the concept of distance becomes blurry, i.e., all training points become uniformly distant from each other. Additionally, the inclusion of noisy variables leads to a loss of predictive performance if the main patterns are contained in just a few variables, since they are equally weighted. Experimental results on low- and high-dimensional datasets demonstrate the importance of these modifications, leading to superior average performance in terms of Area Under the Curve (AUC) compared with the traditional k nearest neighbor approach.

[1]  S. Salzberg,et al.  A weighted nearest neighbor algorithm for learning with symbolic features , 2004, Machine Learning.

[2]  Ahmed Bouridane,et al.  Simultaneous feature selection and feature weighting using Hybrid Tabu Search/K-nearest neighbor classifier , 2007, Pattern Recognit. Lett..

[3]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[4]  Seoung Bum Kim,et al.  Sequential random k-nearest neighbor feature selection for high-dimensional data , 2015, Expert Syst. Appl..

[5]  Anil K. Ghosh,et al.  High dimensional nearest neighbor classification based on mean absolute differences of inter-point distances , 2016, Pattern Recognit. Lett..

[6]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[7]  Gang Li,et al.  Improving the speed and stability of the k-nearest neighbors method , 2012, Pattern Recognit. Lett..

[8]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[9]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[10]  David H. Wolpert,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996, Neural Computation.

[11]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[12]  Punam Mulak,et al.  Analysis of Distance Measures Using K-Nearest Neighbor Algorithm on KDD Dataset , 2015 .

[13]  Einly Lim,et al.  ANO Detection with K-Nearest Neighbor Using Minkowski Distance , 2013, SiPS 2013.

[14]  Naftali Tishby,et al.  Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity , 2005, NIPS.

[15]  Donald A. Adjeroh,et al.  Random KNN feature selection - a fast and stable alternative to Random Forests , 2011, BMC Bioinformatics.

[16]  Leon N. Cooper,et al.  Improving nearest neighbor rule with a simple adaptive distance measure , 2007, Pattern Recognit. Lett..

[17]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[18]  R. Marimont,et al.  Nearest Neighbour Searches and the Curse of Dimensionality , 1979 .

[19]  Robert Tibshirani,et al.  Discriminant Adaptive Nearest Neighbor Classification , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Swagatam Das,et al.  A feature weighted penalty based dissimilarity measure for k-nearest neighbor classification with missing features , 2016, Pattern Recognit. Lett..

[21]  Euntai Kim,et al.  An efficient design of a nearest neighbor classifier for various-scale problems , 2010, Pattern Recognit. Lett..

[22]  Julio López,et al.  Group-penalized feature selection and robust twin SVM classification via second-order cone programming , 2017, Neurocomputing.

[23]  Rong Li,et al.  A new extracting algorithm of k nearest neighbors searching for point clouds , 2014, Pattern Recognit. Lett..

[24]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[25]  Pablo A. Estévez,et al.  A review of feature selection methods based on mutual information , 2013, Neural Computing and Applications.