Prediction-based regularization using data augmented regression

The role of regularization is to control fitted model complexity and variance by penalizing (or constraining) models to be in an area of model space that is deemed reasonable, thus facilitating good predictive performance. This is typically achieved by penalizing a parametric or non-parametric representation of the model. In this paper we advocate instead the use of prior knowledge or expectations about the predictions of models for regularization. This has the twofold advantage of allowing a more intuitive interpretation of penalties and priors and explicitly controlling model extrapolation into relevant regions of the feature space. This second point is especially critical in high-dimensional modeling situations, where the curse of dimensionality implies that new prediction points usually require extrapolation. We demonstrate that prediction-based regularization can, in many cases, be stochastically implemented by simply augmenting the dataset with Monte Carlo pseudo-data. We investigate the range of applicability of this implementation. An asymptotic analysis of the performance of Data Augmented Regression (DAR) in parametric and non-parametric linear regression, and in nearest neighbor regression, clarifies the regularizing behavior of DAR. We apply DAR to simulated and real data, and show that it is able to control the variance of extrapolation, while maintaining, and often improving, predictive accuracy.

[1]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[2]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[3]  S. Geer,et al.  Regularization in statistics , 2006 .

[4]  Tomaso Poggio,et al.  Incorporating prior information in machine learning by creating virtual examples , 1998, Proc. IEEE.

[5]  S. Geer,et al.  Locally adaptive regression splines , 1997 .

[6]  R. Christensen,et al.  A New Perspective on Priors for Generalized Linear Models , 1996 .

[7]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[8]  R. Pace,et al.  Sparse spatial autoregressions , 1997 .

[9]  Ji Zhu,et al.  Kernel Logistic Regression and the Import Vector Machine , 2001, NIPS.

[10]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[11]  Giles Hooker Diagnosing extrapolation: tree-based density estimation , 2004, KDD '04.

[12]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[13]  M. V. Rossum,et al.  In Neural Computation , 2022 .

[14]  Ryan M. Rifkin,et al.  Value Regularization and Fenchel Duality , 2007, J. Mach. Learn. Res..

[15]  Hsin Ying Lin,et al.  Bayesian estimation of item response curves , 1986 .

[16]  Grace Wahba,et al.  Spline Models for Observational Data , 1990 .

[17]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[18]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[19]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[20]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[21]  D. Rubinfeld,et al.  Hedonic housing prices and the demand for clean air , 1978 .

[22]  Ronald Christensen,et al.  Analysis of Variance, Design, and Regression: Applied Statistical Methods , 1996 .

[23]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[24]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[25]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[26]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[27]  Yaser S. Abu-Mostafa,et al.  Hints , 2018, Neural Computation.

[28]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .