论文信息 - Consistent Feature Selection for Pattern Recognition in Polynomial Time

Consistent Feature Selection for Pattern Recognition in Polynomial Time

We analyze two different feature selection problems: finding a minimal feature set optimal for classification (MINIMAL-OPTIMAL) vs. finding all features relevant to the target variable (ALL-RELEVANT). The latter problem is motivated by recent applications within bioinformatics, particularly gene expression analysis. For both problems, we identify classes of data distributions for which there exist consistent, polynomial-time algorithms. We also prove that ALL-RELEVANT is much harder than MINIMAL-OPTIMAL and propose two consistent, polynomial-time algorithms. We argue that the distribution classes considered are reasonable in many practical cases, so that our results simplify feature selection in a wide range of machine learning tasks.

[1] Thomas G. Dietterich,et al. Learning with Many Irrelevant Features , 1991, AAAI.

[2] David Maxwell Chickering,et al. Large-Sample Learning of Bayesian Networks is NP-Hard , 2002, J. Mach. Learn. Res..

[3] David Maxwell Chickering,et al. Finding Optimal Bayesian Networks , 2002, UAI.

[4] Nir Friedman,et al. Inferring Cellular Networks Using Probabilistic Graphical Models , 2004, Science.

[5] Anil K. Jain,et al. On the optimal number of features in the classification of multivariate Gaussian data , 1978, Pattern Recognit..

[6] Catherine Blake,et al. UCI Repository of machine learning databases , 1998 .

[7] G. Casella,et al. Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[8] Ingo Steinwart,et al. On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[9] Judea Pearl,et al. Probabilistic reasoning in intelligent systems , 1988 .

[10] Ron Kohavi,et al. Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[11] Bernhard Schölkopf,et al. Kernel Methods for Measuring Independence , 2005, J. Mach. Learn. Res..

[12] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[13] Constantin F. Aliferis,et al. A theoretical characterization of linear SVM-based feature selection , 2004, ICML '04.

[14] Peter Bühlmann,et al. Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm , 2007, J. Mach. Learn. Res..

[15] Judea Pearl,et al. Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[16] Leslie G. Valiant,et al. A theory of the learnable , 1984, STOC '84.

[17] Ron Kohavi,et al. Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[18] Milan Studeny. Probabilistic Conditional Independence Structures: With 42 Illustrations (Information Science and Statistics) , 2004 .

[19] S. Sathiya Keerthi,et al. Efficient tuning of SVM hyperparameters using radius/margin bound and iterative algorithms , 2002, IEEE Trans. Neural Networks.

[20] J. Mesirov,et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[21] Vladimir Vapnik,et al. The Nature of Statistical Learning , 1995 .

[22] Huan Liu,et al. Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[23] Jan M. Van Campenhout,et al. On the Possible Orderings in the Measurement Selection Problem , 1977, IEEE Transactions on Systems, Man, and Cybernetics.

[24] Thomas G. Dietterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[25] Isabelle Guyon,et al. An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[26] Larry A. Rendell,et al. The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[27] C.J.H. Mann,et al. Probabilistic Conditional Independence Structures , 2005 .

[28] David A. Bell,et al. A Formalism for Relevance and Its Application in Feature Subset Selection , 2000, Machine Learning.

[29] László Györfi,et al. A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[30] Constantin F. Aliferis,et al. Towards Principled Feature Selection: Relevancy, Filters and Wrappers , 2003 .

[31] J. Darnell. Transcription factors as targets for cancer therapy , 2002, Nature Reviews Cancer.

[32] Daphne Koller,et al. Toward Optimal Feature Selection , 1996, ICML.

[33] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[34] Jesper Tegnér,et al. Scalable, Efficient and Correct Learning of Markov Boundaries Under the Faithfulness Assumption , 2005, ECSQARU.

[35] Pat Langley,et al. Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[36] M. Holland,et al. Transcript Abundance in Yeast Varies over Six Orders of Magnitude* , 2002, The Journal of Biological Chemistry.