Gradient boosted feature selection

A feature selection algorithm should ideally satisfy four conditions: reliably extract relevant features; be able to identify non-linear feature interactions; scale linearly with the number of features and dimensions; allow the incorporation of known sparsity structure. In this work we propose a novel feature selection algorithm, Gradient Boosted Feature Selection (GBFS), which satisfies all four of these requirements. The algorithm is flexible, scalable, and surprisingly straight-forward to implement as it is based on a modification of Gradient Boosted Trees. We evaluate GBFS on several real world data sets and show that it matches or outperforms other state of the art feature selection algorithms. Yet it scales to larger data set sizes and naturally allows for domain-specific side information.

[1]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[2]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[3]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[4]  B. E. Eckbo,et al.  Appendix , 1826, Epilepsy Research.

[5]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[6]  James Theiler,et al.  Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space , 2003, J. Mach. Learn. Res..

[7]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[8]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[9]  Volker Roth,et al.  The generalized LASSO , 2004, IEEE Transactions on Neural Networks.

[10]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[12]  Jian Huang,et al.  BMC Bioinformatics BioMed Central Methodology article Supervised group Lasso with applications to microarray data , 2007 .

[13]  Honglak Lee,et al.  Efficient L1 Regularized Logistic Regression , 2006, AAAI.

[14]  P. Sebastiani,et al.  Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer , 2007, Nature Medicine.

[15]  Mee Young Park,et al.  L1‐regularization path algorithm for generalized linear models , 2007 .

[16]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[17]  Volker Roth,et al.  The Group-Lasso for generalized linear models: uniqueness of solutions and efficient algorithms , 2008, ICML '08.

[18]  Tong Zhang,et al.  Multi-stage Convex Relaxation for Learning with Sparse Regularization , 2008, NIPS.

[19]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[20]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[21]  George C. Runger,et al.  Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination , 2009, J. Mach. Learn. Res..

[22]  Feng Pan,et al.  Feature selection for ranking using boosted trees , 2009, CIKM.

[23]  Junzhou Huang,et al.  Learning with structured sparsity , 2009, ICML '09.

[24]  C. Keysers,et al.  An introduction to anatomical ROI-based fMRI classification analysis , 2009, Brain Research.

[25]  Huan Liu,et al.  Advancing feature selection research , 2010 .

[26]  Ya Zhang,et al.  Boosted multi-task learning , 2010, Machine Learning.

[27]  Mohit Sharma,et al.  Decoding Ipsilateral Finger Movements from ECoG Signals in Humans , 2010, NIPS.

[28]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[29]  Stephen Tyree,et al.  Parallel boosted regression trees for web search ranking , 2011, WWW.

[30]  Suvrit Sra,et al.  Fast Projections onto ℓ1, q -Norm Balls for Grouped Feature Selection , 2011, ECML/PKDD.

[31]  Kilian Q. Weinberger,et al.  The Greedy Miser: Learning under Test-time Budgets , 2012, ICML.

[32]  Masashi Sugiyama,et al.  High-Dimensional Feature Selection by Feature-Wise Non-Linear Lasso , 2012, ArXiv.

[33]  Eftychios Sifakis,et al.  A second order virtual node method for elliptic problems with interfaces and irregular domains in three dimensions , 2012, J. Comput. Phys..

[34]  Le Song,et al.  Feature Selection via Dependence Maximization , 2012, J. Mach. Learn. Res..

[35]  Matt J. Kusner,et al.  Cost-Sensitive Tree of Classifiers , 2012, ICML.

[36]  Matt J. Kusner,et al.  Anytime Representation Learning , 2013, ICML.

[37]  Per-Olof Persson,et al.  A Discontinuous Galerkin Method for the Navier-Stokes Equations on Deforming Domains using Unstructured Moving Space-Time Meshes , 2013 .

[38]  Luming Wang,et al.  A high-order discontinuous Galerkin method with unstructured space–time meshes for two-dimensional compressible flows on domains with large deformations , 2015 .