Feature subset selection Filter-Wrapper based on low quality data

Today, feature selection is an active research in machine learning. The main idea of feature selection is to choose a subset of available features, by eliminating features with little or no predictive information, as well as redundant features that are strongly correlated. There are a lot of approaches for feature selection, but most of them can only work with crisp data. Until now there have not been many different approaches which can directly work with both crisp and low quality (imprecise and uncertain) data. That is why, we propose a new method of feature selection which can handle both crisp and low quality data. The proposed approach is based on a Fuzzy Random Forest and it integrates filter and wrapper methods into a sequential search procedure with improved classification accuracy of the features selected. This approach consists of the following main steps: (1) scaling and discretization process of the feature set; and feature pre-selection using the discretization process (filter); (2) ranking process of the feature pre-selection using the Fuzzy Decision Trees of a Fuzzy Random Forest ensemble; and (3) wrapper feature selection using a Fuzzy Random Forest ensemble based on cross-validation. The efficiency and effectiveness of this approach is proved through several experiments using both high dimensional and low quality datasets. The approach shows a good performance (not only classification accuracy, but also with respect to the number of features selected) and good behavior both with high dimensional datasets (microarray datasets) and with low quality datasets.

[1]  Pasi Luukka,et al.  Feature selection using fuzzy entropy measures with similarity classifier , 2011, Expert Syst. Appl..

[2]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[3]  Dunja Mladenic,et al.  Feature Selection for Dimensionality Reduction , 2005, SLSFS.

[4]  Qiang Shen,et al.  Fuzzy-Rough Sets Assisted Attribute Selection , 2007, IEEE Transactions on Fuzzy Systems.

[5]  Yvan Saeys,et al.  In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists , 2007, Bioinform..

[6]  Francisco Herrera,et al.  A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability , 2009, Soft Comput..

[7]  Mário A. T. Figueiredo,et al.  An unsupervised approach to feature discretization and selection , 2012, Pattern Recognit..

[8]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[9]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[10]  B. Ghattas,et al.  Sélection de variables pour la classification binaire en grande dimension: comparaisons et application aux domées de biopuces , 2008 .

[11]  Witold Pedrycz,et al.  Feature analysis through information granulation and fuzzy sets , 2002, Pattern Recognit..

[12]  Jin-Kao Hao,et al.  Advances in metaheuristics for gene selection and classification of microarray data , 2010, Briefings Bioinform..

[13]  Uzay Kaymak,et al.  Fuzzy criteria for feature selection , 2012, Fuzzy Sets Syst..

[14]  Inés Couso,et al.  Mutual information-based feature selection and partition design in fuzzy rule-based classifiers from vague data , 2008, Int. J. Approx. Reason..

[15]  Piero P. Bonissone,et al.  OFP_CLASS: a hybrid method to generate optimized fuzzy partitions for classification , 2012, Soft Comput..

[16]  Andrew R. Webb,et al.  Statistical Pattern Recognition , 1999 .

[17]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[18]  Piero P. Bonissone,et al.  A classification and regression technique to handle heterogeneous and imperfect information , 2010, Soft Comput..

[19]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  María José del Jesús,et al.  Genetic feature selection in a fuzzy rule-based classification system learning process for high-dimensional problems , 2001, Inf. Sci..

[21]  Piero P. Bonissone,et al.  Extending information processing in a Fuzzy Random Forest ensemble , 2012, Soft Comput..

[22]  Jean-Michel Poggi,et al.  Variable selection using random forests , 2010, Pattern Recognit. Lett..

[23]  Ju-Sheng Mi,et al.  Attribute reduction based on generalized fuzzy evidence theory in fuzzy decision systems , 2011, Fuzzy Sets Syst..

[24]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[25]  Piero P. Bonissone,et al.  A fuzzy random forest , 2010, Int. J. Approx. Reason..

[26]  José Ramón Villar,et al.  A Feature Selection Method Using a Fuzzy Mutual Information Measure , 2008, Innovations in Hybrid Intelligent Systems.

[27]  David G. Stork,et al.  Pattern Classification , 1973 .

[28]  G. Victo Sudha George,et al.  Review on Feature Selection Techniques and the Impact of SVM for Cancer Classification using Gene Expression Profile , 2011, ArXiv.

[29]  Ronald R. Yager,et al.  On ordered weighted averaging aggregation operators in multicriteria decision-making , 1988 .

[30]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[31]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[32]  Kazuyuki Murase,et al.  A new hybrid ant colony optimization algorithm for feature selection , 2012, Expert Syst. Appl..

[33]  Ronald R. Yager,et al.  On ordered weighted averaging aggregation operators in multicriteria decisionmaking , 1988, IEEE Trans. Syst. Man Cybern..

[34]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[35]  Qinghua Hu,et al.  Neighborhood based sample and feature selection for SVM classification learning , 2011, Neurocomputing.