Streamwise Feature Selection

In streamwise feature selection, new features are sequentially considered for addition to a predictive model. When the space of potential features is large, streamwise feature selection offers many advantages over traditional feature selection methods, which assume that all features are known in advance. Features can be generated dynamically, focusing the search for new features on promising subspaces, and overfitting can be controlled by dynamically adjusting the threshold for adding features to the model. In contrast to traditional forward feature selection algorithms such as stepwise regression in which at each step all possible features are evaluated and the best one is selected, streamwise feature selection only evaluates each feature once when it is generated. We describe information-investing and α-investing, two adaptive complexity penalty methods for streamwise feature selection which dynamically adjust the threshold on the error reduction required for adding a new feature. These two methods give false discovery rate style guarantees against overfitting. They differ from standard penalty methods such as AIC, BIC and RIC, which always drastically over- or under-fit in the limit of infinite numbers of non-predictive features. Empirical results show that streamwise regression is competitive with (on small data sets) and superior to (on large data sets) much more compute-intensive feature selection methods such as stepwise regression, and allows feature selection on problems with millions of potential features.

[1]  Masoud Nikravesh,et al.  Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[2]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[3]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[4]  I. Johnstone,et al.  Adapting to unknown sparsity by controlling the false discovery rate , 2005, math/0505374.

[5]  Lyle H. Ungar,et al.  Structural Logistic Regression for Link Analysis , 2003 .

[6]  Luc De Raedt,et al.  Multirelational data mining 2003: workshop report , 2003, SKDD.

[7]  R. F.,et al.  Mathematical Statistics , 1944, Nature.

[8]  Jing Zhou,et al.  Streaming feature selection using alpha-investing , 2005, KDD '05.

[9]  Dean P. Foster,et al.  Local Asymptotic Coding and the Minimum Description Length , 1999, IEEE Trans. Inf. Theory.

[10]  Dean Phillips Foster,et al.  Calibration and Empirical Bayes Variable Selection , 1997 .

[11]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[12]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[13]  Jorma Rissanen,et al.  Hypothesis Selection and Testing by the MDL Principle , 1999, Comput. J..

[14]  R. Stine Model Selection Using Information Theory and the MDL Principle , 2004 .

[15]  Jing Zhou,et al.  Streaming Feature Selection using IIC , 2005, AISTATS.

[16]  Lyle H. Ungar,et al.  Cluster-based concept invention for statistical relational learning , 2004, KDD.

[17]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[18]  Naftali Tishby,et al.  Margin based feature selection - theory and algorithms , 2004, ICML.

[19]  E. George The Variable Selection Problem , 2000 .

[20]  I. Johnstone,et al.  Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences , 2004, math/0410088.

[21]  Dean P. Foster,et al.  The risk inflation criterion for multiple regression , 1994 .

[22]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[23]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[24]  R. Larsen An introduction to mathematical statistics and its applications / Richard J. Larsen, Morris L. Marx , 1986 .

[25]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[26]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..