Following the econometric literature on model misspecification, we examine statistical inference for linear regression coefficients βj when the predictors are random and the linear model assumptions of first and/or second order are violated: E[Y |X1, ..., Xp] is not linear in the predictors and/or V [Y |X1, ..., Xp] is not constant. Such inference is meaningful if the linear model is seen as a useful approximation rather than part of a generative truth. A difficulty well-known in econometrics is that the required standard errors under random predictors and model violations can be very different from the conventional standard errors that are valid when the linear model is correct. The difference stems from a synergistic effect between model violations and randomness of the predictors. We show that asymptotically the ratios between correct and conventional standard errors can range between infinity and zero. Also, these ratios vary between predictors within the same multiple regression. This difficulty has consequences for statistics: It entails the breakdown of the classical ancillarity argument for predictors. If the assumptions of a generative regression model are violated, then the ancillarity argument for the predictors no longer holds, treating predictors as fixed is no longer valid, and standard inferences may lose their significance and confidence guarantees. The standard econometric solution for consistent inference under misspecification and random predictors is based on the “sandwich estimator” of the covariance matrix of β̂. A plausible alternative is the paired bootstrap which resamples predictors and response jointly. Discrepancies between conventional and bootstrap standard errors can be used as diagnostics for predictor-specific model violations, in analogy to econometric misspecification tests. The good news is that when model violations are sufficiently strong to invalidate conventional linear inference, their nature tends to be visible in graphical diagnostics.
[1]
C. Clogg,et al.
Statistical Methods for Comparing Regression Coefficients Between Models
,
1995,
American Journal of Sociology.
[2]
Paul D. Allison,et al.
The Impact of Random Predictors on Comparisons of Coefficients Between Models: Comment on Clogg, Petkova, and Haritou
,
1995,
American Journal of Sociology.
[3]
David A. Belsley,et al.
Regression Analysis and its Application: A Data-Oriented Approach.@@@Applied Linear Regression.@@@Regression Diagnostics: Identifying Influential Data and Sources of Collinearity
,
1981
.
[4]
E. Mammen.
Bootstrap and Wild Bootstrap for High Dimensional Linear Models
,
1993
.
[5]
L. Breiman,et al.
Submodel selection and evaluation in regression. The X-random case
,
1992
.
[6]
W. W. Muir,et al.
Regression Diagnostics: Identifying Influential Data and Sources of Collinearity
,
1980
.
[7]
M. Kendall.
Theoretical Statistics
,
1956,
Nature.
[8]
C. Chatfield.
Model uncertainty, data mining and statistical inference
,
1995
.
[9]
G. Box.
Robustness in the Strategy of Scientific Model Building.
,
1979
.
[10]
R. Tibshirani,et al.
Linear Smoothers and Additive Models
,
1989
.
[11]
D. Cox,et al.
An Analysis of Transformations
,
1964
.
[12]
R. Tibshirani,et al.
Generalized Additive Models
,
1986
.
[13]
R. Carroll,et al.
A Note on the Efficiency of Sandwich Covariance Matrix Estimation
,
2001
.
[14]
D. Freedman.
Bootstrapping Regression Models
,
1981
.