On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining

In this paper, we present a new approach for training set selection in large size data sets. The algorithm consists on the combination of stratification and evolutionary algorithms. The stratification reduces the size of domain where the selection is applied while the evolutionary method selects the most representative instances. The performance of the proposal is compared with seven non-evolutionary algorithms, in stratified execution. The analysis follows two evaluating approaches: balance between reduction and accuracy of the subsets selected, and balance between interpretability and accuracy of the representation models associated to these subsets. The algorithms have been assessed on large and huge size data sets. The study shows that the stratified evolutionary instance selection consistently outperforms the non-evolutionary ones. The main advantages are: high instance reduction rates, high classification accuracy and models with high interpretability.

[1]  G. Gates,et al.  The reduced nearest neighbor rule (Corresp.) , 1972, IEEE Trans. Inf. Theory.

[2]  Tim Oates,et al.  The Effects of Training Set Size on Decision Tree Complexity , 1997, ICML.

[3]  Thomas Reinartz,et al.  A Unifying View on Instance Selection , 2002, Data Mining and Knowledge Discovery.

[4]  Zhi-Hua Zhou,et al.  Editing Training Data for kNN Classifiers with Neural Network Ensemble , 2004, ISNN.

[5]  Ian H. Witten,et al.  Making Better Use of Global Discretization , 1999, ICML.

[6]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[7]  Francisco Herrera,et al.  Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study , 2003, IEEE Trans. Evol. Comput..

[8]  James G. Shanahan Soft Computing for Knowledge Discovery , 2000 .

[9]  D. J. Newman,et al.  UCI Repository of Machine Learning Database , 1998 .

[10]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[11]  Donato Malerba,et al.  A Comparative Analysis of Methods for Pruning Decision Trees , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Huan Liu,et al.  Instance Selection and Construction for Data Mining , 2001 .

[13]  Miguel Toro,et al.  Finding representative patterns with ordered projections , 2003, Pattern Recognit..

[14]  José Salvador Sánchez,et al.  High training set size reduction by space partitioning and prototype abstraction , 2004, Pattern Recognit..

[15]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[16]  Tony R. Martinez,et al.  Instance Pruning Techniques , 1997, ICML.

[17]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[18]  Roberto Alejo,et al.  Analysis of new techniques to obtain quality training sets , 2003, Pattern Recognit. Lett..

[19]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[20]  Hugh B. Woodruff,et al.  An algorithm for a selective nearest neighbor decision rule (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[21]  Ian Witten,et al.  Data Mining , 2000 .

[22]  Huan Liu,et al.  Data Reduction via Instance Selection , 2001 .

[23]  Osamu Watanabe,et al.  Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms , 1999, Data Mining and Knowledge Discovery.

[24]  Tulay Yildirim,et al.  A DATA SELECTION METHOD FOR PROBABILISTIC NEURAL NETWORKS , 2004 .

[25]  Alex Alves Freitas,et al.  Discovering interesting knowledge from a science and technology database with a genetic algorithm , 2004, Appl. Soft Comput..

[26]  Colin R. Reeves,et al.  Using Genetic Algorithms for Training Data Selection in RBF Networks , 2001 .

[27]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[28]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[29]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[30]  David W. Aha,et al.  Learning Representative Exemplars of Concepts: An Initial Case Study , 1987 .

[31]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[32]  Richard Nock,et al.  Impact of learning set quality and size on decision tree performances , 2000, Int. J. Comput. Syst. Signals.

[33]  Ludmila I. Kuncheva,et al.  Editing for the k-nearest neighbors rule by a genetic algorithm , 1995, Pattern Recognit. Lett..

[34]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[35]  Francisco Herrera,et al.  Stratification for scaling up evolutionary prototype selection , 2005, Pattern Recognit. Lett..

[36]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[37]  Zbigniew Michalewicz,et al.  Handbook of Evolutionary Computation , 1997 .

[38]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[39]  Pieter Adriaans,et al.  Data mining , 1996 .

[40]  Chris Mellish,et al.  Advances in Instance Selection for Instance-Based Learning Algorithms , 2002, Data Mining and Knowledge Discovery.

[41]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[42]  Larry J. Eshelman,et al.  The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination , 1990, FOGA.

[43]  Huan Liu,et al.  On Issues of Instance Selection , 2002, Data Mining and Knowledge Discovery.

[44]  Colin R. Reeves,et al.  Selection of Training Data for Neural Networks by a Genetic Algorithm , 1998, PPSN.

[45]  Kyuseok Shim,et al.  Building Decision Trees with Constraints , 2001 .

[46]  Hung-Ming Chen,et al.  Design of Nearest Neighbor Classifiers Using an Intelligent Multi-objective Evolutionary Algorithm , 2004, PRICAI.

[47]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[48]  B. John Oommen,et al.  A brief taxonomy and ranking of creative prototype reduction schemes , 2003, Pattern Analysis & Applications.

[49]  Dr. Alex A. Freitas Data Mining and Knowledge Discovery with Evolutionary Algorithms , 2002, Natural Computing Series.