A scalable approach to simultaneous evolutionary instance and feature selection

An enormous amount of information is continually being produced in current research, which poses a challenge for data mining algorithms. Many of the problems in extremely active research areas, such as bioinformatics, security and intrusion detection and text mining, involve large or enormous datasets. These datasets pose serious problems for many data mining algorithms. One method to address very large datasets is data reduction. Among the most useful data reduction methods is simultaneous instance and feature selection. This method achieves a considerable reduction in the training data while maintaining, or even improving, the performance of the data-mining algorithm. However, it suffers from a high degree of scalability problems, even for medium-sized datasets. In this paper, we propose a new evolutionary simultaneous instance and feature selection algorithm that is scalable to millions of instances and thousands of features. This proposal is based on the divide-and-conquer principle combined with bookkeeping. The divide-and-conquer principle allows the execution of the algorithm in linear time. Furthermore, the proposed method is easy to implement using a parallel environment and can work without loading the entire dataset into memory. Using 50 medium-sized datasets, we will demonstrate our method's ability to match the results of state-of-the-art instance and feature selection methods while significantly reducing the time requirements. Using 13 very large datasets, we will demonstrate the scalability of our proposal to millions of instances and thousands of features.

[1]  Francisco Herrera,et al.  IFS-CoCo: Instance and feature selection based on cooperative coevolution with nearest neighbor rule , 2010, Pattern Recognit..

[2]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[3]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[4]  Gunnar Rätsch,et al.  Support Vector Machines and Kernels for Computational Biology , 2008, PLoS Comput. Biol..

[5]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[6]  J. T. de Souza,et al.  A novel approach for integrating feature and instance selection , 2008, ICMLC 2008.

[7]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Larry J. Eshelman,et al.  The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination , 1990, FOGA.

[9]  Nicolás García-Pedrajas,et al.  Nonlinear Boosting Projections for Ensemble Construction , 2007, J. Mach. Learn. Res..

[10]  Eduardo R. Hruschka,et al.  Towards improving cluster-based feature selection with a simplified silhouette filter , 2011, Inf. Sci..

[11]  Francisco Herrera,et al.  Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study , 2003, IEEE Trans. Evol. Comput..

[12]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .

[13]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[14]  Nitesh V. Chawla,et al.  Learning Ensembles from Bites: A Scalable and Accurate Approach , 2004, J. Mach. Learn. Res..

[15]  Xin Yao,et al.  Evolving a cooperative population of neural networks by minimizing mutual information , 2001, Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No.01TH8546).

[16]  Neal Leavitt,et al.  Data Mining for the Corporate Masses? , 2002, Computer.

[17]  Donald E. Brown,et al.  Fast generic selection of features for neural network classifiers , 1992, IEEE Trans. Neural Networks.

[18]  Chris Mellish,et al.  Advances in Instance Selection for Instance-Based Learning Algorithms , 2002, Data Mining and Knowledge Discovery.

[19]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[20]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[21]  Shih-Fu Chang,et al.  Image Retrieval: Current Techniques, Promising Directions, and Open Issues , 1999, J. Vis. Commun. Image Represent..

[22]  Foster J. Provost,et al.  Increasing the Efficiency of Data Mining Algorithms with Breadth-First Marker Propagation , 1997, KDD.

[23]  Yuping Wang,et al.  An orthogonal genetic algorithm with quantization for global numerical optimization , 2001, IEEE Trans. Evol. Comput..

[24]  Salvatore J. Stolfo,et al.  Adaptive Intrusion Detection: A Data Mining Approach , 2000, Artificial Intelligence Review.

[25]  Pedro Larrañaga,et al.  Prototype Selection and Feature Subset Selection by Estimation of Distribution Algorithms. A Case Study in the Survival of Cirrhotic Patients Treated with TIPS , 2001, AIME.

[26]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Richard Weber,et al.  Simultaneous feature selection and classification using kernel-penalized support vector machines , 2011, Inf. Sci..

[28]  Huan Liu,et al.  Feature selection for clustering - a filter solution , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[29]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[30]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[31]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[32]  Hung-Ming Chen,et al.  Design of nearest neighbor classifiers: multi-objective approach , 2005, Int. J. Approx. Reason..

[33]  Nicolás García-Pedrajas,et al.  Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts , 2010, Artif. Intell..

[34]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Feature Subset Selection , 1977, IEEE Transactions on Computers.

[35]  Francisco Herrera,et al.  Enhancing evolutionary instance selection algorithms by means of fuzzy rough set based feature selection , 2012, Inf. Sci..

[36]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[37]  Richard Nock,et al.  A hybrid filter/wrapper approach of feature selection using information theory , 2002, Pattern Recognit..

[38]  Nicolás García-Pedrajas,et al.  Scaling up data mining algorithms: review and taxonomy , 2012, Progress in Artificial Intelligence.

[39]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[40]  Lakhmi C. Jain,et al.  Nearest neighbor classifier: Simultaneous editing and feature selection , 1999, Pattern Recognit. Lett..

[41]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[42]  Francisco Herrera,et al.  Stratified prototype selection based on a steady-state memetic algorithm: a study of scalability , 2010, Memetic Comput..

[43]  Foster J. Provost,et al.  A Survey of Methods for Scaling Up Inductive Algorithms , 1999, Data Mining and Knowledge Discovery.

[44]  Nicolás García-Pedrajas,et al.  A divide-and-conquer recursive approach for scaling up instance selection algorithms , 2009, Data Mining and Knowledge Discovery.

[45]  Francisco Herrera,et al.  Stratification for scaling up evolutionary prototype selection , 2005, Pattern Recognit. Lett..

[46]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[47]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[48]  Carla E. Brodley,et al.  Recursive automatic bias selection for classifier construction , 1995, Machine Learning.

[49]  Jesús S. Aguilar-Ruiz,et al.  Incremental wrapper-based gene selection from microarray data for cancer classification , 2006, Pattern Recognit..

[50]  Marc Boullé,et al.  A Parameter-Free Classification Method for Large Scale Learning , 2009, J. Mach. Learn. Res..

[51]  Francisco Herrera,et al.  A memetic algorithm for evolutionary prototype selection: A scaling up approach , 2008, Pattern Recognit..

[52]  Marco Laumanns,et al.  Performance assessment of multiobjective optimizers: an analysis and review , 2003, IEEE Trans. Evol. Comput..

[53]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[54]  V. Alarcon-Aquino,et al.  Instance Selection and Feature Weighting Using Evolutionary Algorithms , 2006, 2006 15th International Conference on Computing.

[55]  Huan Liu,et al.  On Issues of Instance Selection , 2002, Data Mining and Knowledge Discovery.

[56]  Juan José Rodríguez Diez,et al.  Disturbing Neighbors Diversity for Decision Forests , 2009, Applications of Supervised and Unsupervised Ensemble Methods.

[57]  Yvan Saeys,et al.  Translation initiation site prediction on a genomic scale: beauty in simplicity , 2007, ISMB/ECCB.

[58]  Huan Liu,et al.  Customer Retention via Data Mining , 2000, Artificial Intelligence Review.

[59]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[60]  Nicolás García-Pedrajas,et al.  A cooperative coevolutionary algorithm for instance selection for instance-based learning , 2010, Machine Learning.

[61]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .