Genetic algorithms in feature and instance selection

Feature selection and instance selection are two important data preprocessing steps in data mining, where the former is aimed at removing some irrelevant and/or redundant features from a given dataset and the latter at discarding the faulty data. Genetic algorithms have been widely used for these tasks in related studies. However, these two data preprocessing tasks are generally considered separately in literature. It is unknown what the performance differences would be when feature and instance selection and feature or instance selection are performed individually. Therefore, the aim of this study is to perform feature selection and instance selection based on genetic algorithms using different priorities to examine the classification performances over different domain datasets. The experimental results obtained from four small and large scale datasets containing various numbers of features and data samples show that performing both feature and instance selection usually make the classifiers (i.e., support vector machines and k-nearest neighbor) perform slightly poorer than feature selection or instance selection individually. However, while there is not a significant difference in classification accuracy between these different data preprocessing methods, the combination of feature and instance selection largely reduces the computational effort of training the classifiers, as opposed to performing feature and instance selection individually. Considering both classification effectiveness and efficiency, we demonstrate that performing feature selection first and instance selection second is the optimal solution for data preprocessing in data mining. Both SVM and k-NN classifiers provide similar classification accuracy to the baselines (i.e., those without data preprocessing). The decisions regarding which data preprocessing task to perform for different dataset scales are also discussed.

[1]  Harun Uguz,et al.  A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm , 2011, Knowl. Based Syst..

[2]  Sven F. Crone,et al.  The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing , 2006, Eur. J. Oper. Res..

[3]  Jet Efda Contributors,et al.  Improved feature selection based on genetic algorithms for real time disruption prediction on JET , 2012 .

[4]  RadhaKanta Mahapatra,et al.  Business data mining - a machine learning perspective , 2001, Inf. Manag..

[5]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[6]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[7]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[8]  Srinivasan Parthasarathy,et al.  Fast mining of distance-based outliers in high-dimensional datasets , 2008, Data Mining and Knowledge Discovery.

[9]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[10]  Francisco Herrera,et al.  A Survey on Evolutionary Instance Selection and Generation , 2010, Int. J. Appl. Metaheuristic Comput..

[11]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[12]  J. T. de Souza,et al.  A novel approach for integrating feature and instance selection , 2008, ICMLC 2008.

[13]  Francisco Herrera,et al.  IFS-CoCo: Instance and feature selection based on cooperative coevolution with nearest neighbor rule , 2010, Pattern Recognit..

[14]  M. Esmel ElAlami,et al.  A novel image retrieval model based on the most relevant features , 2011, Knowl. Based Syst..

[15]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[16]  Warren B. Powell,et al.  Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics) , 2007 .

[17]  John J. Grefenstette,et al.  Optimization of Control Parameters for Genetic Algorithms , 1986, IEEE Transactions on Systems, Man, and Cybernetics.

[18]  Shinn-Ying Ho,et al.  Design of an optimal nearest neighbor classifier using an intelligent genetic algorithm , 2002, Pattern Recognit. Lett..

[19]  Serkan Günal,et al.  Subspace based feature selection for pattern recognition , 2008, Inf. Sci..

[20]  Lakhmi C. Jain,et al.  Nearest neighbor classifier: Simultaneous editing and feature selection , 1999, Pattern Recognit. Lett..

[21]  Wang Jeen-Shing,et al.  A Cluster Validity Measure With Outlier Detection for Support Vector Clustering , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[22]  Yafei Zhang,et al.  Dynamic Adaboost learning with feature selection based on parallel genetic algorithm for image annotation , 2010, Knowl. Based Syst..

[23]  Francisco Herrera,et al.  Evolutionary-based selection of generalized instances for imbalanced classification , 2012, Knowl. Based Syst..

[24]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[25]  Huan Liu,et al.  On Issues of Instance Selection , 2002, Data Mining and Knowledge Discovery.

[26]  Spiridon D. Likothanassis,et al.  Integrating feature and instance selection for text classification , 2002, KDD.

[27]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[28]  Sharad Mehrotra,et al.  Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21-24, 2001 , 2001, SIGMOD Conference.

[29]  David C. Yen,et al.  Determinants of intangible assets value: The data mining approach , 2012, Knowl. Based Syst..

[30]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[31]  Michael T. Manry,et al.  Feature Selection Using a Piecewise Linear Network , 2006, IEEE Transactions on Neural Networks.

[32]  Hyeran Byun,et al.  A Survey on Pattern Recognition Applications of Support Vector Machines , 2003, Int. J. Pattern Recognit. Artif. Intell..

[33]  Javier Pérez-Rodríguez,et al.  Multi-selection of instances: A straightforward way to improve evolutionary instance selection , 2012, Appl. Soft Comput..

[34]  Marco Pintore,et al.  Hybrid genetic algorithm for dual selection , 2008, Pattern Analysis and Applications.

[35]  Francisco Herrera,et al.  Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study , 2003, IEEE Trans. Evol. Comput..

[36]  Kyoung-jae Kim,et al.  Bankruptcy prediction modeling with hybrid case-based reasoning and genetic algorithms approach , 2009, Appl. Soft Comput..

[37]  Francisco Herrera,et al.  Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification , 2011, Pattern Recognit..

[38]  Ángel Fernando Kuri Morales,et al.  A search space reduction methodology for data mining in large databases , 2009, Eng. Appl. Artif. Intell..

[39]  Chih-Fong Tsai,et al.  Feature selection in bankruptcy prediction , 2009, Knowl. Based Syst..

[40]  Francisco Herrera,et al.  A Survey on the Application of Genetic Programming to Classification , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[41]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Warren B. Powell,et al.  “Approximate dynamic programming: Solving the curses of dimensionality” by Warren B. Powell , 2007, Wiley Series in Probability and Statistics.

[43]  Thomas Reinartz,et al.  A Unifying View on Instance Selection , 2002, Data Mining and Knowledge Discovery.

[44]  Panos M. Pardalos,et al.  hGA: Hybrid genetic algorithm in fuzzy rule-based classification systems for high-dimensional problems , 2012, Appl. Soft Comput..

[45]  Yaonan Wang,et al.  Texture classification using the support vector machines , 2003, Pattern Recognit..

[46]  Hao Wu,et al.  An effective feature selection method for hyperspectral image classification based on genetic algorithm and support vector machine , 2011, Knowl. Based Syst..

[47]  Witold Pedrycz,et al.  Evolutionary feature selection via structure retention , 2012, Expert Syst. Appl..

[48]  Ingoo Han,et al.  Genetic algorithms approach to feature discretization in artificial neural networks for the prediction of stock price index , 2000 .

[49]  Victor R. L. Shen,et al.  Verification of Knowledge-Based Systems Using Predicate/Transition Nets , 2008, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[50]  Mahantapas Kundu,et al.  A genetic algorithm based region sampling for selection of local features in handwritten digit recognition application , 2012, Appl. Soft Comput..

[51]  Mineichi Kudo,et al.  Comparison of algorithms that select features for pattern classifiers , 2000, Pattern Recognit..

[52]  V. Alarcon-Aquino,et al.  Instance Selection and Feature Weighting Using Evolutionary Algorithms , 2006, 2006 15th International Conference on Computing.

[53]  Jaekyung Yang,et al.  Optimization-based feature selection with adaptive instance sampling , 2006, Comput. Oper. Res..

[54]  Anil K. Jain,et al.  Dimensionality reduction using genetic algorithms , 2000, IEEE Trans. Evol. Comput..

[55]  Selwyn Piramuthu Evaluating feature selection methods for learning in data mining applications , 2004, Eur. J. Oper. Res..

[56]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[57]  Marek Grochowski,et al.  Comparison of Instances Seletion Algorithms I. Algorithms Survey , 2004, ICAISC.