Conditional predictive inference post model selection

We give a finite-sample analysis of predictive inference procedures after model selection in regression with random design. The analysis is focused on a statistically challenging scenario where the number of potentially important explanatory variables can be infinite, where no regularity conditions are imposed on unknown parameters, where the number of explanatory variables in a "good" model can be of the same order as sample size and where the number of candidate models can be of larger order than sample size. The performance of inference procedures is evaluated conditional on the training sample. Under weak conditions on only the number of candidate models and on their complexity, and uniformly over all data-generating processes under consideration, we show that a certain prediction interval is approximately valid and short with high probability in finite samples, in the sense that its actual coverage probability is close to the nominal one and in the sense that its length is close to the length of an infeasible interval that is constructed by actually knowing the "best" candidate model. Similar results are shown to hold for predictive inference procedures other than prediction intervals like, for example, tests of whether a future response will lie above or below a given threshold.

[1]  J. Robins,et al.  Adaptive nonparametric confidence sets , 2006, math/0605473.

[2]  Yannick Baraud,et al.  Confidence balls in Gaussian regression , 2004 .

[3]  M. Thompson Selection of Variables in Multiple Regression: Part II. Chosen Procedures, Computations and Examples , 1978 .

[4]  H. Leeb,et al.  CAN ONE ESTIMATE THE UNCONDITIONAL DISTRIBUTION OF POST-MODEL-SELECTION ESTIMATORS? , 2003, Econometric Theory.

[5]  H. Leeb Evaluation and selection of models for out-of-sample prediction when the sample size is small relative to the complexity of the data-generating process , 2008, 0802.3364.

[6]  T. Cai,et al.  An adaptation theory for nonparametric confidence intervals , 2004, math/0503662.

[7]  M.,et al.  THE FINITE-SAMPLE DISTRIBUTION OF POST-MODEL-SELECTION ESTIMATORS AND UNIFORM VERSUS NONUNIFORM APPROXIMATIONS , 2003, Econometric Theory.

[8]  P. Schellhammer,et al.  Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. , 2002, Cancer research.

[9]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Seymour Geisser,et al.  8. Predictive Inference: An Introduction , 1995 .

[11]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[12]  T.M. Souders,et al.  Cutting the high cost of testing , 1991, IEEE Spectrum.

[13]  G. Wahba Bayesian "Confidence Intervals" for the Cross-validated Smoothing Spline , 1983 .

[14]  Paul Kabaila,et al.  On the Large-Sample Minimal Coverage Probability of Confidence Intervals After Model Selection , 2006 .

[15]  Hannes Leeb,et al.  The Finite-Sample Distribution of Post-Model-Selection Estimators, and Uniform Versus Non-Uniform Approximations , 2000 .

[16]  B. M. Pötscher,et al.  MODEL SELECTION AND INFERENCE: FACTS AND FICTION , 2005, Econometric Theory.

[17]  D. Freedman,et al.  How Many Variables Should Be Entered in a Regression Equation , 1983 .

[18]  Ker-Chau Li,et al.  Honest Confidence Regions for Nonparametric Regression , 1989 .

[19]  O. Lepski,et al.  Random rates in anisotropic regression (with a discussion and a rejoinder by the authors) , 2002 .

[20]  R. R. Hocking The analysis and selection of variables in linear regression , 1976 .

[21]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[22]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[23]  Douglas Nychka,et al.  Bayesian Confidence Intervals for Smoothing Splines , 1988 .

[24]  Sophie Lambert-Lacroix,et al.  On nonparametric confidence set estimation , 2001 .

[25]  David R. Cox,et al.  Prediction and asymptotics , 1996 .

[26]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[27]  R. Beran,et al.  Modulation Estimators and Confidence Sets , 1998 .

[28]  Gerard N. Stenbakken,et al.  Test-point selection and testability measures via QR factorization of linear models , 1987, IEEE Transactions on Instrumentation and Measurement.

[29]  T. Tony Cai,et al.  Adaptive Confidence Balls , 2006 .

[30]  Hannes Leeb,et al.  The distribution of a linear predictor after model selection: Unconditional finite-sample distributions and asymptotic approximations , 2005, math/0611186.

[31]  A. Adam Ding,et al.  Prediction Intervals, Factor Analysis Models, and High-Dimensional Empirical Linear Prediction , 1999 .

[32]  V. M. Joshi,et al.  Admissibility of the Usual Confidence Sets for the Mean of a Univariate or Bivariate Normal Population , 1969 .

[33]  Christopher R. Genovese,et al.  Confidence sets for nonparametric wavelet regression , 2005, math/0505632.

[34]  Marc Hoffmann Random rates in anisotropic regression , 2002 .

[35]  Xiaotong Shen,et al.  Inference After Model Selection , 2004 .

[36]  B. M. Pötscher Effects of Model Selection on Inference , 1991, Econometric Theory.

[37]  Christopher R. Genovese,et al.  Adaptive confidence bands , 2007 .

[38]  B. M. Pötscher,et al.  CAN ONE ESTIMATE THE UNCONDITIONAL DISTRIBUTION OF POST-MODEL-SELECTION ESTIMATORS? , 2007, Econometric Theory.