Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination

Predictive models benefit from a compact, non-redundant subset of features that improves interpretability and generalization. Modern data sets are wide, dirty, mixed with both numerical and categorical predictors, and may contain interactive effects that require complex models. This is a challenge for filters, wrappers, and embedded feature selection methods. We describe details of an algorithm using tree-based ensembles to generate a compact subset of non-redundant features. Parallel and serial ensembles of trees are combined into a mixed method that can uncover masking and detect features of secondary effect. Simulated and actual examples illustrate the effectiveness of the approach.

[1]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[2]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[3]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[4]  Oleksandr Makeyev,et al.  Neural network with ensembles , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[5]  Yali Amit,et al.  Shape Quantization and Recognition with Randomized Trees , 1997, Neural Computation.

[6]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[7]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[8]  Eugene Tuv,et al.  Feature Selection Using Ensemble Based Ranking Against Artificial Contrasts , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[9]  Giorgio Valentini,et al.  Ensembles of Learning Machines , 2002, WIRN.

[10]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[11]  Gérard Dreyfus,et al.  Ranking a Random Feature for Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[12]  André Elisseeff,et al.  Algorithmic Stability and Generalization Performance , 2000, NIPS.

[13]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[14]  Thomas G. Dietterich,et al.  Learning Boolean Concepts in the Presence of Many Irrelevant Features , 1994, Artif. Intell..

[15]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  L. Breiman Arcing Classifiers , 1998 .

[18]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[19]  T. Poggio,et al.  General conditions for predictivity in learning theory , 2004, Nature.

[20]  Steven Salzberg,et al.  A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features , 2004, Machine Learning.

[21]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[22]  J. Friedman,et al.  Projection Pursuit Regression , 1981 .

[23]  Paul W. Munro,et al.  Improving Committee Diagnosis with Resampling Techniques , 1995, NIPS.

[24]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[25]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[26]  Sayan Mukherjee,et al.  Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization , 2006, Adv. Comput. Math..

[27]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[28]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[29]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[30]  Constantin F. Aliferis,et al.  TIED: An Artificially Simulated Dataset with Multiple Markov Boundaries , 2010, NIPS Causality: Objectives and Assessment.

[31]  Giorgio Valentini,et al.  Low Bias Bagged Support Vector Machines , 2003, ICML.

[32]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[33]  G DietterichThomas An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees , 2000 .

[34]  Abdelaziz Berrado,et al.  Using metarules to organize and group discovered association rules , 2006, Data Mining and Knowledge Discovery.

[35]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[36]  David L. Waltz,et al.  Toward memory-based reasoning , 1986, CACM.

[37]  Douglas C. Montgomery,et al.  Resampling methods for variable selection in robust regression , 2003, Comput. Stat. Data Anal..

[38]  T. Poggio,et al.  Bagging Regularizes , 2002 .

[39]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[40]  Léopold Simar,et al.  Computer Intensive Methods in Statistics , 1994 .

[41]  Eugene Tuv,et al.  Tree-Based Ensembles with Dynamic Soft Feature Selection , 2006, Feature Extraction.

[42]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[43]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..