论文信息 - Variable Selection for Classification and Regression in Large p, Small n Problems

Variable Selection for Classification and Regression in Large p, Small n Problems

Classification and regression problems in which the number of predictor variables is larger than the number of observations are increasingly common with rapid technological advances in data collection. Because some of these variables may have little or no influence on the response, methods that can identify the unimportant variables are needed. Two methods that have been proposed for this purpose are EARTH and Random forest (RF). This article presents an alternative method, derived from the GUIDE classification and regression tree algorithm, that employs recursive partitioning to determine the degree of importance of the variables. Simulation experiments show that the new method improves the prediction accuracy of several nonparametric regression models more than Random forest and EARTH. The results indicate that it is not essential to correctly identify all the important variables in every situation. Conditions for which this occurs are obtained for the linear model. The article concludes with an application of the new method to identify rare molecules in a large genomic data set.

Wei-Yin Loh | W. Loh | Wei-Yin Loh

[1] J. Friedman. Multivariate adaptive regression splines , 1990 .

[2] Achim Zeileis,et al. Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[3] Wei-Yin Loh,et al. Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[4] W. Loh,et al. Improving the precision of classification trees , 2010, 1011.0608.

[5] W. Loh,et al. REGRESSION TREES WITH UNBIASED VARIABLE SELECTION AND INTERACTION DETECTION , 2002 .

[6] Alan J. Lee,et al. Linear Regression Analysis: Seber/Linear , 2003 .

[7] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[8] Leo Breiman,et al. Classification and Regression Trees , 1984 .

[9] Satterthwaite Fe. An approximate distribution of estimates of variance components. , 1946 .

[10] Eugene Tuv,et al. Feature Selection Using Ensemble Based Ranking Against Artificial Contrasts , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[11] Christopher M. Bishop,et al. Classification and regression , 1997 .

[12] George A. F. Seber,et al. Linear regression analysis , 1977 .

[13] F. E. Satterthwaite. An approximate distribution of estimates of variance components. , 1946, Biometrics.

[14] K. Doksum,et al. Nonparametric Variable Selection: The EARTH Algorithm , 2008 .

[15] Herman Chernoff,et al. Discovering influential variables: A method of partitions , 2009, 1009.5744.