Predicting the number of nearest neighbors for the k-NN classification algorithm

k-Nearest Neighbor k-NN is one of the most widely used classification algorithms. When classifying a new instance, k-NN first finds out its k nearest neighbors, and then classifies it by voting for the categories of the k nearest neighbors. Therefore, an appropriate number of nearest neighbors is critical for the k-NN classifier. However, in present, there is no systematical solution to determine the specific value of k. In order to address this problem, we propose a novel method of using back-propagation neural networks to explore the relationship between data set characteristics and the optimal values of k, then the relationship and the data set characteristics of a new data set are used to recommend the value of k for this data set. The experimental results on the 49 UCI benchmark data sets show that compared with the optimal k values, although there is a decrease of 1.61% in the average classification accuracy for the k-NN classifier with the recommended k values, the time for determining the k values is greatly shortened.

[1]  Thomas G. Dietterich,et al.  A study of distance-based machine learning algorithms , 1994 .

[2]  João Gama,et al.  Characterizing the Applicability of Classification Algorithms Using Meta-Level Learning , 1994, ECML.

[3]  Ingoo Han,et al.  Global optimization of feature weights and the number of neighbors that combine in a case‐based reasoning system , 2006, Expert Syst. J. Knowl. Eng..

[4]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[5]  Venu Govindaraju,et al.  Improved k-nearest neighbor classification , 2002, Pattern Recognit..

[6]  Shengyi Jiang,et al.  An improved K-nearest-neighbor algorithm for text categorization , 2012, Expert Syst. Appl..

[7]  João Gama,et al.  Characterization of Classification Algorithms , 1995, EPIA.

[8]  Nobuhiro Yugami,et al.  Theoretical Analysis of the Nearest Neighbor Classifier in Noisy Domains , 1996, ICML.

[9]  Etienne Barnard,et al.  Data characteristics that determine classifier performance , 2006 .

[10]  Jun Shao,et al.  The Efficiency and Consistency of Approximations to the Jackknife Variance Estimators , 1989 .

[11]  Antanas Verikas,et al.  Feature selection with neural networks , 2002, Pattern Recognit. Lett..

[12]  D. Tax,et al.  The characterization of classification problems by classifier disagreements , 2004, ICPR 2004.

[13]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Ian Witten,et al.  Data Mining , 2000 .

[15]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[16]  KATE A. SMITH,et al.  Modelling the relationship between problem characteristics and data mining algorithm performance using neural networks , 2001 .

[17]  Dongmo Zhang,et al.  A KNN-Based Learning Method for Biology Species Categorization , 2005, ICNC.

[18]  E. Clarke,et al.  Entropy and MDL discretization of continuous variables for Bayesian belief networks , 2000 .

[19]  Susan Craw,et al.  Self-optimising CBR retrieval , 2000, Proceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence. ICTAI 2000.

[20]  Thierry Denoeux,et al.  A k-nearest neighbor classification rule based on Dempster-Shafer theory , 1995, IEEE Trans. Syst. Man Cybern..

[21]  Kuo-Chen Chou,et al.  Ensemble classifier for protein fold pattern recognition , 2006, Bioinform..

[22]  Tin Kam Ho,et al.  On classifier domains of competence , 2004, ICPR 2004.

[23]  J. R. Quinlan,et al.  Comparing connectionist and symbolic learning methods , 1994, COLT 1994.

[24]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[25]  Seishi Okamoto,et al.  An Average-Case Analysis of k-Nearest Neighbor Classifier , 1995, ICCBR.

[26]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[27]  Mark A. Girolami,et al.  An empirical analysis of the probabilistic K-nearest neighbour classifier , 2007, Pattern Recognit. Lett..

[28]  Kate Smith-Miles,et al.  On learning algorithm selection for classification , 2006, Appl. Soft Comput..

[29]  Hyunchul Ahn,et al.  Using genetic algorithms to optimize nearest neighbors for data mining , 2008, Ann. Oper. Res..

[30]  Zhi-Hua Zhou,et al.  Improve Multi-Instance Neural Networks through Feature Selection , 2004, Neural Processing Letters.

[31]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[32]  Selwyn Piramuthu,et al.  Using Feature Construction to Improve the Performance of Neural Networks , 1998 .

[33]  David A. Landgrebe,et al.  Decision boundary feature extraction for neural networks , 1997, IEEE Trans. Neural Networks.

[34]  Tianzi Jiang,et al.  A combinational feature selection and ensemble neural network method for classification of gene expression data , 2004, BMC Bioinformatics.

[35]  Vipin Kumar,et al.  Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification , 2001, PAKDD.

[36]  Trevor Darrell,et al.  Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing) , 2006 .

[37]  R. M. Chandrasekaran,et al.  Evaluation of k-Nearest Neighbor classifier performance for direct marketing , 2010, Expert Syst. Appl..

[38]  Khairullah Khan,et al.  A Review of Machine Learning Algorithms for Text-Documents Classification , 2010 .

[39]  Arkadiusz Wojna,et al.  RIONA: A New Classification System Combining Rule Induction and Instance-Based Learning , 2002, Fundam. Informaticae.

[40]  Chaochang Chiu,et al.  A case-based customer classification approach for direct marketing , 2002, Expert Syst. Appl..

[41]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.