Incremental filter and wrapper approaches for feature discretization

Discrete data representations are necessary, or at least convenient, in many machine learning problems. While feature selection (FS) techniques aim at finding relevant subsets of features, the goal of feature discretization (FD) is to find concise (quantized) data representations, adequate for the learning task at hand. In this paper, we propose two incremental methods for FD. The first method belongs to the filter family, in which the quality of the discretization is assessed by a (supervised or unsupervised) relevance criterion. The second method is a wrapper, where discretized features are assessed using a classifier. Both methods can be coupled with any static (unsupervised or supervised) discretization procedure and can be used to perform FS as pre-processing or post-processing stages. The proposed methods attain efficient representations suitable for binary and multi-class problems with different types of data, being competitive with existing methods. Moreover, using well-known FS methods with the features discretized by our techniques leads to better accuracy than with the features discretized by other methods or with the original features.

[1]  Mário A. T. Figueiredo,et al.  Efficient feature selection filters for high-dimensional data , 2012, Pattern Recognit. Lett..

[2]  David G. Stork,et al.  Pattern Classification , 1973 .

[3]  Huan Liu,et al.  Advancing Feature Selection Research − ASU Feature Selection Repository , 2010 .

[4]  Andrew K. C. Wong,et al.  Information Discovery through Hierarchical Maximum Entropy Discretization and Synthesis , 1991, Knowledge Discovery in Databases.

[5]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[6]  Chong-Ho Choi,et al.  Input Feature Selection by Mutual Information Based on Parzen Window , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[8]  J. Preston Ξ-filters , 1983 .

[9]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[10]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[11]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[12]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[13]  Tomasz Winiarski,et al.  Feature selection based on information theory, consistency and separability indices , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[14]  Marc Boullé,et al.  MODL: A Bayes optimal discretization method for continuous attributes , 2006, Machine Learning.

[15]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[16]  Lei Wang,et al.  Efficient Spectral Feature Selection with Minimum Redundancy , 2010, AAAI.

[17]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[18]  Masoud Nikravesh,et al.  Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[19]  Jianchu Kang,et al.  A comparative study on unsupervised feature selection methods for text clustering , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[20]  Wei-Pang Yang,et al.  A discretization algorithm based on Class-Attribute Contingency Coefficient , 2008, Inf. Sci..

[21]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[22]  Genichi Taguchi,et al.  Taguchi on Robust Technology Development , 1992 .

[23]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[24]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[25]  Geoffrey I. Webb,et al.  Not So Naive Bayes: Aggregating One-Dependence Estimators , 2005, Machine Learning.

[26]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[27]  Ruoming Jin,et al.  Data discretization unification , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[28]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[29]  Huan Liu,et al.  Chi2: feature selection and discretization of numeric attributes , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[30]  Chong-Ho Choi,et al.  Input feature selection for classification problems , 2002, IEEE Trans. Neural Networks.

[31]  Shu-Ching Chen,et al.  Effective supervised discretization for classification based on correlation maximization , 2011, 2011 IEEE International Conference on Information Reuse & Integration.

[32]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[33]  Benoît Frénay,et al.  On the Potential Inadequacy of Mutual Information for Feature Selection , 2012, ESANN.

[34]  Marc Boullé,et al.  Khiops: A Statistical Discretization Method of Continuous Attributes , 2004, Machine Learning.

[35]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[36]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[37]  S. Kotsiantis,et al.  Discretization Techniques: A recent survey , 2006 .

[38]  Luis González Abril,et al.  Ameva: An autonomous discretization algorithm , 2009, Expert Syst. Appl..

[39]  Constantin F. Aliferis,et al.  GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data , 2005, Int. J. Medical Informatics.

[40]  Pablo Suau,et al.  Information Theory in Computer Vision and Pattern Recognition , 2009 .

[41]  Colas Schretter,et al.  Information-Theoretic Feature Selection in Microarray Data Using Variable Complementarity , 2008, IEEE Journal of Selected Topics in Signal Processing.

[42]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[43]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[44]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[45]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[47]  Geoffrey I. Webb,et al.  Proportional k-Interval Discretization for Naive-Bayes Classifiers , 2001, ECML.

[48]  Lukasz A. Kurgan,et al.  CAIM discretization algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[49]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[50]  Igor Kononenko,et al.  On Biases in Estimating Multi-Valued Attributes , 1995, IJCAI.

[51]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[52]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[53]  J. Biesiada,et al.  Feature ranking methods based on information entropy with Parzen windows , 2005 .

[54]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[55]  Sanmay Das,et al.  Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection , 2001, ICML.

[56]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[57]  Rongcheng Liu,et al.  An Unsupervised Feature Selection Algorithm: Laplacian Score Combined with Distance-Based Entropy Measure , 2009, 2009 Third International Symposium on Intelligent Information Technology Application.

[58]  LujánMikel,et al.  Conditional likelihood maximisation , 2012 .

[59]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[60]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[61]  Jane Labadin,et al.  Feature selection based on mutual information , 2015, 2015 9th International Conference on IT in Asia (CITA).

[62]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[63]  E. Clarke,et al.  Entropy and MDL discretization of continuous variables for Bayesian belief networks , 2000 .

[64]  B. Bonev Feature Selection based on Information Theory , 2010 .