Assumption Lean Regression

Abstract It is well known that with observational data, models used in conventional regression analyses are commonly misspecified. Yet in practice, one tends to proceed with interpretations and inferences that rely on correct specification. Even those who invoke Box’s maxim that all models are wrong proceed as if results were generally useful. Misspecification, however, has implications that affect practice. Regression models are approximations to a true response surface and should be treated as such. Accordingly, regression parameters should be interpreted as statistical functionals. Importantly, the regressor distribution affects targets of estimation and regressor randomness affects the sampling variability of estimates. As a consequence, inference should be based on sandwich estimators or the pairs (x–y) bootstrap. Traditional prediction intervals lose their pointwise coverage guarantees, but empirically calibrated intervals can be justified for future populations. We illustrate the key concepts with an empirical application.

[1]  F. Bachoc,et al.  Uniformly valid confidence intervals post-model-selection , 2016, The Annals of Statistics.

[2]  Victor Chernozhukov,et al.  Quantile regression , 2019, Journal of Econometrics.

[3]  Andreas Buja,et al.  Models as Approximations II: A Model-Free Theory of Parametric Regression , 2016, Statistical Science.

[4]  Kai Zhang,et al.  Models as Approximations I: Consequences Illustrated with Linear Regression , 2014, Statistical Science.

[5]  Stephen M. S. Lee,et al.  A bootstrap recipe for post-model-selection inference under linear regression models , 2018, Biometrika.

[6]  Arun K. Kuchibhotla,et al.  A Model Free Perspective for Linear Regression: Uniform-in-model Bounds for Post Selection Inference , 2018 .

[7]  Stefan Wager,et al.  Estimation and Inference of Heterogeneous Treatment Effects using Random Forests , 2015, Journal of the American Statistical Association.

[8]  Todd A. Kuffner,et al.  On overfitting and post‐selection uncertainty assessments , 2017, 1712.02379.

[9]  Christopher D. Chambers,et al.  Redefine statistical significance , 2017, Nature Human Behaviour.

[10]  Arun K. Kuchibhotla,et al.  Models as Approximations --- Part II: A General Theory of Model-Robust Regression , 2016, 1612.03257.

[11]  R. Berk,et al.  Forecasting Domestic Violence: A Machine Learning Approach to Help Inform Arraignment Decisions , 2016 .

[12]  A. Buja,et al.  Statistica Sinica Preprint No : SS-2016-0546 R 1 Title Calibrated Percentile Double Bootstrap For Robust Linear Regression Inference , 2017 .

[13]  D. Rubin,et al.  Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction , 2016 .

[14]  A. Buja,et al.  Models as Approximations, Part I: A Conspiracy of Nonlinearity and Random Regressors in Linear Regression , 2014, 1404.1578.

[15]  Dennis L. Sun,et al.  Exact post-selection inference, with application to the lasso , 2013, 1311.6238.

[16]  A. Buja,et al.  Valid post-selection inference , 2013, 1306.1059.

[17]  J. Robins,et al.  Improved double-robust estimation in missing data and causal inference models. , 2012, Biometrika.

[18]  R. Berk Criminal Justice Forecasts of Risk: A Machine Learning Approach , 2012 .

[19]  Richard Berk,et al.  Criminal Justice Forecasts of Risk , 2012, SpringerBriefs in Computer Science.

[20]  Joshua D. Angrist,et al.  Mostly Harmless Econometrics: An Empiricist's Companion , 2008 .

[21]  D. Olds Preventing Child Maltreatment and Crime with Prenatal and Infancy Support of Parents: The Nurse‐Family Partnership , 2008, Journal of Scandinavian studies in criminology and crime prevention.

[22]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[23]  A. Tsiatis Semiparametric Theory and Missing Data , 2006 .

[24]  J. Robins,et al.  Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.

[25]  D. Freedman Graphical Models for Causation, and the Identification Problem , 2004 .

[26]  R. Berk Regression Analysis: A Constructive Critique , 2003 .

[27]  Debashis Kushary,et al.  Bootstrap Methods and Their Application , 2000, Technometrics.

[28]  E. Mammen The Bootstrap and Edgeworth Expansion , 1997 .

[29]  D. Rubin Statistics and Causal Inference: Comment: Which Ifs Have Causal Answers , 1986 .

[30]  D. Freedman Bootstrapping Regression Models , 1981 .

[31]  H. White Using Least Squares to Approximate Unknown Regression Functions , 1980 .

[32]  Edward E. Leamer,et al.  Specification Searches: Ad Hoc Inference with Nonexperimental Data , 1980 .

[33]  J. Hausman Specification tests in econometrics , 1978 .

[34]  G. Box Science and Statistics , 1976 .

[35]  B. Levit,et al.  On the Efficiency of a Class of Non-Parametric Estimates , 1976 .

[36]  S. R. Searle Linear Models , 1971 .

[37]  R. Fisher 035: The Distribution of the Partial Correlation Coefficient. , 1924 .