Discriminative Feature Selection by Nonparametric Bayes Error Minimization

Feature selection is fundamental to knowledge discovery from massive amount of high-dimensional data. In an effort to establish theoretical justification for feature selection algorithms, this paper presents a theoretically optimal criterion, namely, the discriminative optimal criterion (DoC) for feature selection. Compared with the existing representative optimal criterion (RoC, [CHECK END OF SENTENCE]) which retains maximum information for modeling the relationship between input and output variables, DoC is pragmatically advantageous because it attempts to directly maximize the classification accuracy and naturally reflects the Bayes error in the objective. To make DoC computationally tractable for practical tasks, we propose an algorithmic framework, which selects a subset of features by minimizing the Bayes error rate estimated by a nonparametric estimator. A set of existing algorithms as well as new ones can be derived naturally from this framework. As an example, we show that the Relief algorithm [CHECK END OF SENTENCE] greedily attempts to minimize the Bayes error estimated by the k-Nearest-Neighbor (kNN) method. This new interpretation insightfully reveals the secret behind the family of margin-based feature selection algorithms [CHECK END OF SENTENCE], [CHECK END OF SENTENCE] and also offers a principled way to establish new alternatives for performance improvement. In particular, by exploiting the proposed framework, we establish the Parzen-Relief (P-Relief) algorithm based on Parzen window estimator, and the MAP-Relief (M-Relief) which integrates label distribution into the max-margin objective to effectively handle imbalanced and multiclass data. Experiments on various benchmark data sets demonstrate the effectiveness of the proposed algorithms.

[1]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[2]  Hiroshi Mamitsuka,et al.  Query-learning-based iterative feature-subset selection for learning from high-dimensional data sets , 2005, Knowledge and Information Systems.

[3]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[4]  Deniz Erdogmus,et al.  Feature extraction using information-theoretic learning , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Zheng Bao,et al.  Large Margin Feature Weighting Method via Linear Programming , 2009, IEEE Transactions on Knowledge and Data Engineering.

[6]  David W. Aha,et al.  A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms , 1997, Artificial Intelligence Review.

[7]  Marko Robnik-Sikonja,et al.  Comprehensible Interpretation of Relief's Estimates , 2001, ICML.

[8]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[9]  Nuno Vasconcelos Feature selection by maximum marginal diversity: optimality and implications for visual recognition , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[10]  Shuang-Hong Yang,et al.  Language pyramid and multi-scale text analysis , 2010, CIKM.

[11]  Hans-Peter Kriegel,et al.  Feature Weighting and Instance Selection for Collaborative Filtering: An Information-Theoretic Approach* , 2003, Knowledge and Information Systems.

[12]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[13]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[14]  Yijun Sun,et al.  Iterative RELIEF for Feature Weighting: Algorithms, Theories, and Applications , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[16]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[17]  Runze Li,et al.  Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery , 2006, math/0602133.

[18]  Shaohua Kevin Zhou,et al.  Variational Graph Embedding for Globally and Locally Consistent Feature Extraction , 2009, ECML/PKDD.

[19]  Shuang-Hong Yang,et al.  Feature Selection by Nonparametric Bayes Error Minimization , 2008, PAKDD.

[20]  U. Feige,et al.  Spectral Graph Theory , 2015 .

[21]  Gustavo Carneiro,et al.  Minimum Bayes error features for visual recognition by sequential feature selection and extraction , 2005, The 2nd Canadian Conference on Computer and Robot Vision (CRV'05).

[22]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[23]  Shuang-Hong Yang,et al.  Efficient Feature Selection in the Presence of Outliers and Noises , 2008, AIRS.

[24]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[25]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[26]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[27]  George Saon,et al.  Minimum Bayes Error Feature Selection for Continuous Speech Recognition , 2000, NIPS.

[28]  Cyrus Shahabi,et al.  Feature subset selection and feature ranking for multivariate time series , 2005, IEEE Transactions on Knowledge and Data Engineering.

[29]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[30]  Naftali Tishby,et al.  Margin based feature selection - theory and algorithms , 2004, ICML.

[31]  Kari Torkkola,et al.  Feature Extraction by Non-Parametric Mutual Information Maximization , 2003, J. Mach. Learn. Res..

[32]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[34]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[35]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[36]  José Ranilla,et al.  Introducing a family of linear measures for feature selection in text categorization , 2005, IEEE Transactions on Knowledge and Data Engineering.

[37]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[38]  David R. Musicant,et al.  Lagrangian Support Vector Machines , 2001, J. Mach. Learn. Res..

[39]  Glenn Fung,et al.  SVM Feature Selection for Classification of SPECT Images of Alzheimer's Disease Using Spatial Information , 2005, ICDM.

[40]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[41]  Yun Q. Shi,et al.  Feature Selection based on the Bhattacharyya Distance , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[42]  Huan Liu,et al.  Spectral feature selection for supervised and unsupervised learning , 2007, ICML '07.

[43]  Keinosuke Fukunaga,et al.  Bayes Error Estimation Using Parzen and k-NN Procedures , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[45]  Qingfeng Chen,et al.  Discovery of Structural and Functional Features in RNA Pseudoknots , 2009, IEEE Transactions on Knowledge and Data Engineering.

[46]  Salim Hariri,et al.  A new dependency and correlation analysis for features , 2005, IEEE Transactions on Knowledge and Data Engineering.

[47]  Chulhee Lee,et al.  Feature extraction based on the Bhattacharyya distance , 2003, Pattern Recognit..

[48]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[49]  David G. Stork,et al.  Pattern Classification , 1973 .

[50]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.