Adjusting Stepwise p-Values in Generalized Linear Models

Stepwise methods for variable selection are frequently used to determine the predictors of an outcome in generalized linear models. Although it is widely used within the scientific community, it is well known that the tests on the explained deviance of the selected model are biased. This arises from the fact that the traditional test statistics upon which these methods are based were intended for testing pre-specified hypotheses; instead, the tested model is selected through a data-driven procedure. A multiplicity problem therefore arises. In this work, we define and discuss a nonparametric procedure to adjust the p-value of the selected model of any stepwise selection method. The unbiasedness and consistency of the method is also proved. A simulation study shows the validity of this procedure. Theoretical differences with previous works in the same field are also discussed.

[1]  R. R. Hocking The analysis and selection of variables in linear regression , 1976 .

[2]  M. Stone,et al.  Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[3]  J. Copas,et al.  Estimating the Residual Variance in Orthogonal Regression with Variable Selection , 1991 .

[4]  Luigi Salmaso,et al.  A new nonparametric approach for multiplicity control:Optimal Subset procedures , 2005, Comput. Stat..

[5]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[6]  Walter Krämer,et al.  Review of Modern applied statistics with S, 4th ed. by W.N. Venables and B.D. Ripley. Springer-Verlag 2002 , 2003 .

[7]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[8]  H. Yanagihara,et al.  On distribution of AIC in linear regression models , 2005 .

[9]  Debashis Kushary,et al.  Bootstrap Methods and Their Application , 2000, Technometrics.

[10]  Peter C Austin,et al.  Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality. , 2004, Journal of clinical epidemiology.

[11]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[12]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[13]  Luigi Salmaso,et al.  FDR- and FWE-controlling methods using data-driven weights , 2007 .

[14]  M. Chavance [Jackknife and bootstrap]. , 1992, Revue d'epidemiologie et de sante publique.

[15]  S. T. Buckland,et al.  Computer Intensive Statistical Methods: Validation, Model Selection, and Bootstrap , 1993 .

[16]  T. Stukel,et al.  Determinants of plasma levels of beta-carotene and retinol. Skin Cancer Prevention Study Group. , 1989, American journal of epidemiology.

[17]  Eugene Grechanovsky,et al.  Conditional p-values for the F-statistic in a forward selection procedure , 1995 .

[18]  M. Wegkamp,et al.  Consistent variable selection in high dimensional regression via multiple testing , 2006 .

[19]  I. Johnstone,et al.  Adapting to unknown sparsity by controlling the false discovery rate , 2005, math/0505374.

[20]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[21]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[22]  P. Young,et al.  Time series analysis, forecasting and control , 1972, IEEE Transactions on Automatic Control.

[23]  J. S. Urban Hjorth,et al.  Computer Intensive Statistical Methods: Validation, Model Selection, and Bootstrap , 1993 .

[24]  K. Gabriel,et al.  On closed testing procedures with special reference to ordered analysis of variance , 1976 .

[25]  Luigi Salmaso,et al.  Weighted methods controlling the multiplicity when the number of variables is much higher than the number of observations , 2006 .

[26]  Alan J. Miller Sélection of subsets of regression variables , 1984 .

[27]  Laurence S. Freedman,et al.  The problem of underestimating the residual error variance in forward stepwise regression , 1992 .