Stratified prototype selection based on a steady-state memetic algorithm: a study of scalability

Prototype selection (PS) is a suitable data reduction process for refining the training set of a data mining algorithm. Performing PS processes over existing datasets can sometimes be an inefficient task, especially as the size of the problem increases. However, in recent years some techniques have been developed to avoid the drawbacks that appeared due to the lack of scalability of the classical PS approaches. One of these techniques is known as stratification. In this study, we test the combination of stratification with a previously published steady-state memetic algorithm for PS in various problems, ranging from 50,000 to more than 1 million instances. We perform a comparison with some well-known PS methods, and make a deep study of the effects of stratification in the behavior of the selected method, focused on its time complexity, accuracy and convergence capabilities. Furthermore, the trade-off between accuracy and efficiency of the proposed combination is analyzed, concluding that it is a very suitable option to perform PS tasks when the size of the problem exceeds the capabilities of the classical PS methods.

[1]  Xin Yao,et al.  Evolving edited k-Nearest Neighbor Classifiers , 2008, Int. J. Neural Syst..

[2]  Trevor Darrell,et al.  Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing) , 2006 .

[3]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[4]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[5]  James C. Bezdek,et al.  Nearest prototype classification: clustering, genetic algorithms, or random search? , 1998, IEEE Trans. Syst. Man Cybern. Part C.

[6]  Trevor Darrell,et al.  Nearest-Neighbor Methods in Learning and Vision , 2008, IEEE Trans. Neural Networks.

[7]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[8]  A. E. Eiben,et al.  Introduction to Evolutionary Computing , 2003, Natural Computing Series.

[9]  Lawrence O. Hall,et al.  A scalable framework for cluster ensembles , 2009, Pattern Recognit..

[10]  Shumeet Baluja,et al.  A Method for Integrating Genetic Search Based Function Optimization and Competitive Learning , 1994 .

[11]  Padraig Cunningham,et al.  A Taxonomy of Similarity Mechanisms for Case-Based Reasoning , 2009, IEEE Transactions on Knowledge and Data Engineering.

[12]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[13]  Marek Grochowski,et al.  Comparison of Instances Seletion Algorithms I. Algorithms Survey , 2004, ICAISC.

[14]  Ludmila I. Kuncheva,et al.  Editing for the k-nearest neighbors rule by a genetic algorithm , 1995, Pattern Recognit. Lett..

[15]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[16]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[17]  Elena Marchiori,et al.  Hit Miss Networks with Applications to Instance Selection , 2008, J. Mach. Learn. Res..

[18]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[19]  Hisao Ishibuchi,et al.  Special Issue on Memetic Algorithms , 2007, IEEE Trans. Syst. Man Cybern. Part B.

[20]  B. John Oommen,et al.  A brief taxonomy and ranking of creative prototype reduction schemes , 2003, Pattern Analysis & Applications.

[21]  Fabrizio Angiulli,et al.  Fast Nearest Neighbor Condensation for Large Data Sets Classification , 2007, IEEE Transactions on Knowledge and Data Engineering.

[22]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[23]  Francisco Herrera,et al.  A memetic algorithm for evolutionary prototype selection: A scaling up approach , 2008, Pattern Recognit..

[24]  Peter Merz,et al.  Solving the routing and wavelength assignment problem with a multilevel distributed memetic algorithm , 2009, Memetic Comput..

[25]  Dr. Alex A. Freitas Data Mining and Knowledge Discovery with Evolutionary Algorithms , 2002, Natural Computing Series.

[26]  D. J. Newman,et al.  UCI Repository of Machine Learning Database , 1998 .

[27]  Huan Liu,et al.  Instance Selection and Construction for Data Mining , 2001 .

[28]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[29]  Zhenzhen Liu,et al.  A Fast and Scalable Recurrent Neural Network Based on Stochastic Meta Descent , 2008, IEEE Transactions on Neural Networks.

[30]  Chia-Cheng Liu,et al.  Design of an optimal nearest neighbor classifier using an intelligent genetic algorithm , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[31]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[32]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[33]  José Francisco Martínez Trinidad,et al.  A new fast prototype selection method based on clustering , 2010, Pattern Analysis and Applications.

[34]  Lakhmi C. Jain,et al.  Evolutionary computation in data mining , 2005 .

[35]  Hugh B. Woodruff,et al.  An algorithm for a selective nearest neighbor decision rule (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[36]  Edmund K. Burke,et al.  Improving the scalability of rule-based evolutionary learning , 2009, Memetic Comput..

[37]  Foster J. Provost,et al.  A Survey of Methods for Scaling Up Inductive Algorithms , 1999, Data Mining and Knowledge Discovery.

[38]  Hisao Ishibuchi,et al.  Evolution of Reference Sets in Nearest Neighbor Classification , 1998, SEAL.

[39]  Nicolás García-Pedrajas,et al.  A divide-and-conquer recursive approach for scaling up instance selection algorithms , 2009, Data Mining and Knowledge Discovery.

[40]  Francisco Herrera,et al.  Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study , 2003, IEEE Trans. Evol. Comput..

[41]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[42]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[43]  James Smith,et al.  A tutorial for competent memetic algorithms: model, taxonomy, and design issues , 2005, IEEE Transactions on Evolutionary Computation.

[44]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2022 .

[45]  Francisco Herrera,et al.  Real-Coded Memetic Algorithms with Crossover Hill-Climbing , 2004, Evolutionary Computation.

[46]  B. John Oommen,et al.  On using prototype reduction schemes to optimize dissimilarity-based classification , 2007, Pattern Recognit..

[47]  Hiroshi Motoda,et al.  Book Review: Computational Methods of Feature Selection , 2007, The IEEE intelligent informatics bulletin.

[48]  Ajith Abraham,et al.  Swarm Intelligence in Data Mining (Studies in Computational Intelligence) , 2006 .

[49]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[50]  Ruhul A. Sarker,et al.  Memetic algorithms for solving job-shop scheduling problems , 2009, Memetic Comput..

[51]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[52]  William E. Hart,et al.  Recent Advances in Memetic Algorithms , 2008 .

[53]  Pedro Larrañaga,et al.  Prototype Selection and Feature Subset Selection by Estimation of Distribution Algorithms. A Case Study in the Survival of Cirrhotic Patients Treated with TIPS , 2001, AIME.

[54]  J. H. Zar,et al.  Biostatistical Analysis (5th Edition) , 1984 .

[55]  Mario Cortina-Borja,et al.  Handbook of Parametric and Nonparametric Statistical Procedures, 5th edn , 2012 .

[56]  Huan Liu,et al.  On Issues of Instance Selection , 2002, Data Mining and Knowledge Discovery.

[57]  Apostolos N. Papadopoulos,et al.  Nearest Neighbor Search:: A Database Perspective , 2004 .

[58]  Filiberto Pla,et al.  Experimental study on prototype optimisation algorithms for prototype-based classification in vector spaces , 2006, Pattern Recognit..

[59]  Wei-Yin Loh,et al.  A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms , 2000, Machine Learning.

[60]  Hitoshi Iba,et al.  The Memetic Tree-based Genetic Algorithm and its application to Portfolio Optimization , 2009, Memetic Comput..

[61]  Francisco Herrera,et al.  Evolutionary stratified training set selection for extracting classification rules with trade off precision-interpretability , 2007, Data Knowl. Eng..

[62]  W. Hart Adaptive global optimization with local search , 1994 .

[63]  Francisco Herrera,et al.  A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability , 2009, Soft Comput..

[64]  Larry J. Eshelman,et al.  The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination , 1990, FOGA.

[65]  Ajith Abraham,et al.  Swarm Intelligence in Data Mining , 2009, Swarm Intelligence in Data Mining.

[66]  Larry J. Eshelman The CHC Adaptive Search Algo-rithm , 1991 .

[67]  Michael Stonebraker,et al.  The Morgan Kaufmann Series in Data Management Systems , 1999 .

[68]  James C. Bezdek,et al.  Nearest prototype classifier designs: An experimental study , 2001, Int. J. Intell. Syst..

[69]  Pablo Moscato,et al.  On Evolution, Search, Optimization, Genetic Algorithms and Martial Arts : Towards Memetic Algorithms , 1989 .

[70]  Francisco Herrera,et al.  Stratification for scaling up evolutionary prototype selection , 2005, Pattern Recognit. Lett..

[71]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.