Does data splitting improve prediction?

Data splitting divides data into two parts. One part is reserved for model selection. In some applications, the second part is used for model validation but we use this part for estimating the parameters of the chosen model. We focus on the problem of constructing reliable predictive distributions for future observed values. We judge the predictive performance using log scoring. We compare the full data strategy with the data splitting strategy for prediction. We show how the full data score can be decomposed into model selection, parameter estimation and data reuse costs. Data splitting is preferred when data reuse costs are high. We investigate the relative performance of the strategies in four simulation scenarios. We introduce a hybrid estimator that uses one part for model selection but both parts for estimation. We argue that a split data analysis is prefered to a full data analysis for prediction with some exceptions.

[1]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[2]  D. Cox A note on data-splitting for the evaluation of significance levels , 1975 .

[3]  John W. Tukey,et al.  Data Analysis and Regression: A Second Course in Statistics , 1977 .

[4]  Frederick Mosteller,et al.  Data Analysis and Regression , 1978 .

[5]  R. Dennis Cook,et al.  Cross-Validation of Regression Models , 1984 .

[6]  D. Hinkley,et al.  The Analysis of Transformed Data , 1984 .

[7]  A. P. Dawid,et al.  Present position and potential developments: some personal views , 1984 .

[8]  Ellen B. Roecker,et al.  Prediction error and its estimation for subset-selected models , 1991 .

[9]  Alan J. Miller,et al.  Subset Selection in Regression , 1991 .

[10]  R. Hirsch Validation samples. , 1991, Biometrics.

[11]  B. M. Pötscher Effects of Model Selection on Inference , 1991, Econometric Theory.

[12]  A. Atkinson Subset Selection in Regression , 1992 .

[13]  J. Faraway On the Cost of Data Analysis , 1992 .

[14]  C. Chatfield Model uncertainty, data mining and statistical inference , 1995 .

[15]  D G Altman,et al.  What do we mean by validating a prognostic model? , 2000, Statistics in medicine.

[16]  Brian D. Ripley,et al.  Modern Applied Statistics with S Fourth edition , 2002 .

[17]  D. Mark,et al.  Clinical prediction models: are we building better mousetraps? , 2003, Journal of the American College of Cardiology.

[18]  Walter Krämer,et al.  Review of Modern applied statistics with S, 4th ed. by W.N. Venables and B.D. Ripley. Springer-Verlag 2002 , 2003 .

[19]  B. M. Pötscher,et al.  MODEL SELECTION AND INFERENCE: FACTS AND FICTION , 2005, Econometric Theory.

[20]  J. Lawless,et al.  Frequentist prediction intervals and predictive distributions , 2005 .

[21]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..

[22]  R. Little Calibrated Bayes , 2006 .

[23]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[24]  Yehuda Koren,et al.  Lessons from the Netflix prize challenge , 2007, SKDD.

[25]  Harald Binder,et al.  Assessment of survival prediction models based on microarray data , 2007, Bioinform..

[26]  F. Dahl,et al.  Data splitting as a countermeasure against hypothesis fishing: with a case study of predictors for low back pain , 2008, European Journal of Epidemiology.

[27]  Dylan S. Small,et al.  Split Samples and Design Sensitivity in Observational Studies , 2009 .

[28]  A. Belloni,et al.  Least Squares After Model Selection in High-Dimensional Sparse Models , 2009, 1001.0188.

[29]  Z. Reitermanová Data Splitting , 2010 .

[30]  Richard A. Berk,et al.  Statistical Inference After Model Selection , 2010 .

[31]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[32]  J. Carpenter May the best analyst win. , 2011, Science.

[33]  S. Lauritzen,et al.  Proper local scoring rules , 2011, 1101.5011.

[34]  Jan-Willem Romeijn,et al.  ‘All models are wrong...’: an introduction to model uncertainty , 2012 .

[35]  Martha Sajatovic,et al.  Clinical Prediction Models , 2013 .

[36]  K. Singh,et al.  Confidence Distribution, the Frequentist Distribution Estimator of a Parameter: A Review , 2013 .

[37]  Xiao-Li Meng,et al.  I Got More Data, My Model is More Refined, but My Estimator is Getting Worse! Am I Just Dumb? , 2014 .

[38]  E. Soofi,et al.  Arnold Zellner: Scientist, Leader, Mentor, and Friend , 2014 .