FCNN-MR: A Parallel Instance Selection Method Based on Fast Condensed Nearest Neighbor Rule

Instance selection (IS) technique is used to reduce the data size to improve the performance of data mining methods. Recently, to process very large data set, several proposed methods divide the training set into some disjoint subsets and apply IS algorithms independently to each subset. In this paper, we analyze the limitation of these methods and give our viewpoint about how to divide and conquer in IS procedure. Then, based on fast condensed nearest neighbor (FCNN) rule, we propose a large data sets instance selection method with MapReduce framework. Besides ensuring the prediction accuracy and reduction rate, it has two desirable properties: First, it reduces the work load in the aggregation node; Second and most important, it produces the same result with the sequential version, which other parallel methods cannot achieve. We evaluate the performance of FCNN-MR on one small data set and two large data sets. The experimental results show that it is effective and practical. Keywords—Instance selection, data reduction, MapReduce, kNN.

[1]  Francisco Herrera,et al.  Stratification for scaling up evolutionary prototype selection , 2005, Pattern Recognit. Lett..

[2]  Francisco Herrera,et al.  Data Preprocessing in Data Mining , 2014, Intelligent Systems Reference Library.

[3]  Francisco Herrera,et al.  A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[6]  M. Narasimha Murty,et al.  An incremental prototype set building technique , 2002, Pattern Recognit..

[7]  G. Gates The Reduced Nearest Neighbor Rule , 1998 .

[8]  William Eberle,et al.  Learning to detect representative data for large scale instance selection , 2015, J. Syst. Softw..

[9]  Nicolás García-Pedrajas,et al.  A divide-and-conquer recursive approach for scaling up instance selection algorithms , 2009, Data Mining and Knowledge Discovery.

[10]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[11]  Xizhao Wang,et al.  Voting-based instance selection from large data sets with MapReduce and random weight networks , 2016, Inf. Sci..

[12]  Hadi Sadoghi Yazdi,et al.  IRAHC: Instance Reduction Algorithm using Hyperrectangle Clustering , 2015, Pattern Recognit..

[13]  Francisco Herrera,et al.  MRPR: A MapReduce solution for prototype reduction in big data classification , 2015, Neurocomputing.

[14]  José Francisco Martínez Trinidad,et al.  A new fast prototype selection method based on clustering , 2010, Pattern Analysis and Applications.

[15]  Fabrizio Angiulli,et al.  Fast condensed nearest neighbor rule , 2005, ICML.

[16]  Fabrizio Angiulli,et al.  Fast Nearest Neighbor Condensation for Large Data Sets Classification , 2007, IEEE Transactions on Knowledge and Data Engineering.

[17]  Mineichi Kudo,et al.  Comparison of algorithms that select features for pattern classifiers , 2000, Pattern Recognit..

[18]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[19]  David B. Skalak,et al.  Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms , 1994, ICML.

[20]  Chien-Hsing Chou,et al.  The Generalized Condensed Nearest Neighbor Rule as A Data Reduction Method , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[21]  Francisco Herrera,et al.  Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study , 2003, IEEE Trans. Evol. Comput..

[22]  Huan Liu,et al.  On Issues of Instance Selection , 2002, Data Mining and Knowledge Discovery.

[23]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[24]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[25]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[26]  Aytug Onan,et al.  A fuzzy-rough nearest neighbor classifier combined with consistency-based subset evaluation and instance selection for automated diagnosis of breast cancer , 2015, Expert Syst. Appl..