RHC: a non-parametric cluster-based data reduction for efficient k\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\

Abstract Although the $$k$$k-NN classifier is a popular classification method, it suffers from the high computational cost and storage requirements it involves. This paper proposes two effective cluster-based data reduction algorithms for efficient $$k$$k-NN classification. Both have low preprocessing cost and can achieve high data reduction rates while maintaining $$k$$k-NN classification accuracy at high levels. The first proposed algorithm is called reduction through homogeneous clusters (RHC) and is based on a fast preprocessing clustering procedure that creates homogeneous clusters. The centroids of these clusters constitute the reduced training set. The second proposed algorithm is a dynamic version of RHC that retains all its properties and, in addition, it can manage datasets that cannot fit in main memory and is appropriate for dynamic environments where new training data are gradually available. Experimental results, based on fourteen datasets, illustrate that both algorithms are faster and achieve higher reduction rates than four known methods, while maintaining high classification accuracy.

[1]  José Francisco Martínez Trinidad,et al.  Mixed Data Object Selection Based on Clustering and Border Objects , 2007, CIARP.

[2]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[3]  Francisco Herrera,et al.  A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[4]  Eyke Hüllermeier,et al.  An Efficient Algorithm for Instance-Based Learning on Data Streams , 2007, ICDM.

[5]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[6]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[7]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[8]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[9]  Dennis F. Kibler,et al.  Learning Symbolic Prototypes , 1997, ICML.

[10]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[11]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[12]  María Teresa Lozano Albalate,et al.  Data Reduction Techniques in Classification Processes , 2007 .

[13]  G. Gates,et al.  The reduced nearest neighbor rule (Corresp.) , 1972, IEEE Trans. Inf. Theory.

[14]  Georgios Evangelidis,et al.  Fast and Accurate k-Nearest Neighbor Classification Using Prototype Selection by Clustering , 2012, 2012 16th Panhellenic Conference on Informatics.

[15]  Georgios Evangelidis,et al.  Efficient dataset size reduction by finding homogeneous clusters , 2012, BCI '12.

[16]  Eyke Hüllermeier,et al.  Efficient instance-based learning on data streams , 2007, Intell. Data Anal..

[17]  José Francisco Martínez Trinidad,et al.  A new fast prototype selection method based on clustering , 2010, Pattern Analysis and Applications.

[18]  M. Narasimha Murty,et al.  An incremental prototype set building technique , 2002, Pattern Recognit..

[19]  Junjie Wu,et al.  Advances in K-means Clustering , 2012, Springer Theses.

[20]  Hanan Samet,et al.  Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling) , 2005 .

[21]  Luc De Raedt,et al.  Proceedings of the 22nd international conference on Machine learning , 2005 .

[22]  Li Wei,et al.  Fast time series classification using numerosity reduction , 2006, ICML.

[23]  Hugh B. Woodruff,et al.  An algorithm for a selective nearest neighbor decision rule (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[24]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[25]  José Francisco Martínez Trinidad,et al.  Object Selection Based on Clustering and Border Objects , 2008, Computer Recognition Systems 2.

[26]  Godfried T. Toussaint,et al.  Proximity Graphs for Nearest Neighbor Decision Rules: Recent Progress , 2002 .

[27]  Belur V. Dasarathy,et al.  Nearest Neighbour Editing and Condensing Tools–Synergy Exploitation , 2000, Pattern Analysis & Applications.

[28]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[29]  Junjie Wu,et al.  Advances in K-means clustering: a data mining thinking , 2012 .

[30]  Chris Mellish,et al.  Advances in Instance Selection for Instance-Based Learning Algorithms , 2002, Data Mining and Knowledge Discovery.

[31]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[32]  Chien-Hsing Chou,et al.  The Generalized Condensed Nearest Neighbor Rule as A Data Reduction Method , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[33]  Amir F. Atiya,et al.  Self-generating prototypes for pattern classification , 2007, Pattern Recognit..

[34]  G. Gates The Reduced Nearest Neighbor Rule , 1998 .

[35]  Marek Grochowski,et al.  Comparison of Instance Selection Algorithms II. Results and Comments , 2004, ICAISC.

[36]  S. Hodge,et al.  Statistics and Probability , 1972 .

[37]  Fabrizio Angiulli,et al.  Fast Nearest Neighbor Condensation for Large Data Sets Classification , 2007, IEEE Transactions on Knowledge and Data Engineering.

[38]  Francisco Herrera,et al.  A memetic algorithm for evolutionary prototype selection: A scaling up approach , 2008, Pattern Recognit..

[39]  I. Tomek An Experiment with the Edited Nearest-Neighbor Rule , 1976 .

[40]  Mike James,et al.  Classification Algorithms , 1986, Encyclopedia of Machine Learning and Data Mining.

[41]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  José Francisco Martínez Trinidad,et al.  Using Maximum Similarity Graphs to Edit Nearest Neighbor Classifiers , 2009, CIARP.

[43]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[44]  David W. Aha,et al.  Tolerating Noisy, Irrelevant and Novel Attributes in Instance-Based Learning Algorithms , 1992, Int. J. Man Mach. Stud..

[45]  Francesc J. Ferri,et al.  An efficient prototype merging strategy for the condensed 1-NN rule through class-conditional hierarchical clustering , 2002, Pattern Recognit..

[46]  William W. Cohen,et al.  Proceedings of the 23rd international conference on Machine learning , 2006, ICML 2008.

[47]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[48]  José Francisco Martínez Trinidad,et al.  A review of instance selection methods , 2010, Artificial Intelligence Review.

[49]  José Salvador Sánchez,et al.  High training set size reduction by space partitioning and prototype abstraction , 2004, Pattern Recognit..

[50]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[51]  C. H. Chen,et al.  A sample set condensation algorithm for the class sensitive artificial neural network , 1996, Pattern Recognit. Lett..

[52]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[53]  Handbook of Parametric and Nonparametric Statistical Procedures , 2004 .

[54]  Chin-Liang Chang,et al.  Finding Prototypes For Nearest Neighbor Classifiers , 1974, IEEE Transactions on Computers.

[55]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[56]  Fabrizio Angiulli,et al.  Fast condensed nearest neighbor rule , 2005, ICML.

[57]  Marek Grochowski,et al.  Comparison of Instances Seletion Algorithms I. Algorithms Survey , 2004, ICAISC.