Feature Selection Based on Class-Dependent Densities for High-Dimensional Binary Data

Data and knowledge management systems employ feature selection algorithms for removing irrelevant, redundant, and noisy information from the data. There are two well-known approaches to feature selection, feature ranking (FR) and feature subset selection (FSS). In this paper, we propose a new FR algorithm, termed as class-dependent density-based feature elimination (CDFE), for binary data sets. Our theoretical analysis shows that CDFE computes the weights, used for feature ranking, more efficiently as compared to the mutual information measure. Effectively, rankings obtained from both the two criteria approximate each other. CDFE uses a filtrapper approach to select a final subset. For data sets having hundreds of thousands of features, feature selection with FR algorithms is simple and computationally efficient but redundant information may not be removed. On the other hand, FSS algorithms analyze the data for redundancies but may become computationally impractical on high-dimensional data sets. We address these problems by combining FR and FSS methods in the form of a two-stage feature selection algorithm. When introduced as a preprocessing step to the FSS algorithms, CDFE not only presents them with a feature subset, good in terms of classification, but also relieves them from heavy computations. Two FSS algorithms are employed in the second stage to test the two-stage feature selection idea. We carry out experiments with two different classifiers (naive Bayes' and kernel ridge regression) on three different real-life data sets (NOVA, HIVA, and GINA) of the”Agnostic Learning versus Prior Knowledge” challenge. As a stand-alone method, CDFE shows up to about 92 percent reduction in the feature set size. When combined with the FSS algorithms in two-stages, CDFE significantly improves their classification accuracy and exhibits up to 97 percent reduction in the feature set size. We also compared CDFE against the winning entries of the challenge and found that it outperforms the best results on NOVA and HIVA while obtaining a third position in case of GINA.

[1]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[2]  D. W. Scott,et al.  PROBABILITY DENSITY ESTIMATION IN HIGHER DIMENSIONS , 2014 .

[3]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[4]  David A. Landgrebe,et al.  Projection pursuit in high dimensional data reduction: initial conditions, feature selection and the assumption of normality , 1995, 1995 IEEE International Conference on Systems, Man and Cybernetics. Intelligent Systems for the 21st Century.

[5]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[6]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[7]  Lawrence K. Saul,et al.  Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifold , 2003, J. Mach. Learn. Res..

[8]  Mehreen Saeed Bernoulli Mixture Models for Markov Blanket Filtering and Classification , 2008, WCCI Causation and Prediction Challenge.

[9]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[10]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[11]  Alfons Juan-Císcar,et al.  Bernoulli mixture models for binary images , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[12]  Antonino Freno,et al.  Selecting Features by Learning Markov Blankets , 2007, KES.

[13]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[14]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[15]  Mehreen Saeed,et al.  Classifiers based on Bernoulli mixture models for text mining and handwriting recognition tasks , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[16]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[17]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[18]  Isabelle Guyon,et al.  Agnostic Learning vs. Prior Knowledge Challenge , 2007, 2007 International Joint Conference on Neural Networks.

[19]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Alfons Juan-Císcar,et al.  On the use of Bernoulli mixture models for text classification , 2001, Pattern Recognit..

[21]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[22]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[23]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[24]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[25]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[26]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[27]  Jesús S. Aguilar-Ruiz,et al.  Analysis of Feature Rankings for Classification , 2005, IDA.

[28]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[29]  Anil K. Jain,et al.  Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  David A. Landgrebe,et al.  Supervised classification in high-dimensional space: geometrical, statistical, and asymptotical properties of multivariate data , 1998, IEEE Trans. Syst. Man Cybern. Part C.

[31]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[32]  Salim Hariri,et al.  A new dependency and correlation analysis for features , 2005, IEEE Transactions on Knowledge and Data Engineering.

[33]  R W Doerge,et al.  Variable Selection in High‐Dimensional Multivariate Binary Data with Application to the Analysis of Microbial Community DNA Fingerprints , 2002, Biometrics.

[34]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[35]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .