Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient for Out-of-Distribution Generalization

A common explanation for the failure of deep networks to generalize out-of-distribution is that they fail to recover the “correct” features. We challenge this notion with a simple experiment which suggests that ERM already learns sufficient features and that the current bottleneck is not feature learning, but robust regression . Our findings also imply that given a small amount of data from the target distribution, retraining only the last linear layer will give excellent performance. We therefore argue that devising simpler methods for learning predictors on existing features is a promising direction for future research. Towards this end, we introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift. Rather than learning one function, DARE performs a domain-specific adjustment to unify the domains in a canonical latent space and learns to predict in this space. Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions. Further, we provide the first finite-environment convergence guarantee to the minimax risk, improving over existing analyses which only yield minimax predictors after an environment threshold. Evaluated on finetuned features, we find that DARE compares favorably to prior methods, consistently achieving equal or better performance.

[1]  Bo Li,et al.  Provable Domain Generalization via Invariant-Feature Subspace Recovery , 2022, ICML.

[2]  Badr Youbi Idrissi,et al.  Simple data balancing achieves competitive worst-group-accuracy , 2021, CLeaR.

[3]  Andrej Risteski,et al.  Iterative Feature Matching: Toward Provable Domain Generalization with Logarithmic Environments , 2021, NeurIPS.

[4]  Pradeep Ravikumar,et al.  An Online Learning Approach to Interpolation and Extrapolation in Domain Generalization , 2021, AISTATS.

[5]  Uri Shalit,et al.  On Calibration and Out-of-domain Generalization , 2021, NeurIPS.

[6]  Danica J. Sutherland,et al.  Does Invariant Risk Minimization Capture Invariance? , 2021, AISTATS.

[7]  P. Bühlmann,et al.  Domain adaptation under structural causal models , 2020, J. Mach. Learn. Res..

[8]  Boaz Barak,et al.  For self-supervised learning, Rationality implies generalization, provably , 2020, ICLR.

[9]  Pradeep Ravikumar,et al.  The Risks of Invariant Risk Minimization , 2020, ICLR.

[10]  Federico Tombari,et al.  Batch Normalization Embeddings for Deep Domain Generalization , 2020, Pattern Recognit..

[11]  David Lopez-Paz,et al.  In Search of Lost Domain Generalization , 2020, ICLR.

[12]  Aleksander Madry,et al.  Noise or Signal: The Role of Image Backgrounds in Object Recognition , 2020, ICLR.

[13]  Tatsunori B. Hashimoto,et al.  Distributionally Robust Neural Networks , 2020, ICLR.

[14]  Benjamin Recht,et al.  The Effect of Natural Distribution Shift on Question Answering Models , 2020, ICML.

[15]  Alexei A. Efros,et al.  Test-Time Training with Self-Supervision for Generalization under Distribution Shifts , 2019, ICML.

[16]  T. Klock,et al.  Estimating covariance and precision matrices along subspaces , 2019, Electronic Journal of Statistics.

[17]  Bohyung Han,et al.  Learning to Optimize Domain Specific Normalization for Domain Generalization , 2019, ECCV.

[18]  David Lopez-Paz,et al.  Invariant Risk Minimization , 2019, ArXiv.

[19]  Bohyung Han,et al.  Domain-Specific Batch Normalization for Unsupervised Domain Adaptation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Han Zhao,et al.  On Learning Invariant Representations for Domain Adaptation , 2019, ICML.

[21]  Yunwen Lei,et al.  A Generalization Error Bound for Multi-class Domain Generalization , 2019, ArXiv.

[22]  Benjamin Recht,et al.  Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[23]  Bo Wang,et al.  Moment Matching for Multi-Source Domain Adaptation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[25]  M. Carlsson Perturbation theory for the matrix square root and matrix modulus , 2018, 1810.01464.

[26]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[27]  Zachary C. Lipton,et al.  How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.

[28]  Arthur Jacot,et al.  Neural Tangent Kernel: Convergence and Generalization in Neural Networks , 2018, NeurIPS.

[29]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[30]  N. Meinshausen,et al.  Anchor regression: Heterogeneous data meet causality , 2018, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[31]  Yongxin Yang,et al.  Deeper, Broader and Artier Domain Generalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Sethuraman Panchanathan,et al.  Deep Hashing Network for Unsupervised Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  George Trigeorgis,et al.  Domain Separation Networks , 2016, NIPS.

[34]  Kate Saenko,et al.  Deep CORAL: Correlation Alignment for Deep Domain Adaptation , 2016, ECCV Workshops.

[35]  Jiaying Liu,et al.  Revisiting Batch Normalization For Practical Domain Adaptation , 2016, ICLR.

[36]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Bernhard Schölkopf,et al.  Invariant Models for Causal Transfer Learning , 2015, J. Mach. Learn. Res..

[38]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[39]  Jonas Peters,et al.  Causal inference by using invariant prediction: identification and confidence intervals , 2015, 1501.01332.

[40]  V. Koltchinskii,et al.  Concentration inequalities and moment bounds for sample covariance operators , 2014, 1405.2468.

[41]  Tengyao Wang,et al.  A useful variant of the Davis--Kahan theorem for statisticians , 2014, 1405.0680.

[42]  Brian C. Lovell,et al.  Unsupervised Domain Adaptation by Domain Invariant Projection , 2013, 2013 IEEE International Conference on Computer Vision.

[43]  Ye Xu,et al.  Unbiased Metric Learning: On the Utilization of Multiple Datasets and Web Images for Softening Bias , 2013, 2013 IEEE International Conference on Computer Vision.

[44]  G. Blanchard,et al.  Generalizing from Several Related Classification Tasks to a New Unlabeled Sample , 2011, NIPS.

[45]  Scott L. Zeger,et al.  comments and a rejoinder by the authors) , 2000 .

[46]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[47]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[48]  Ker-Chau Li,et al.  Regression Analysis Under Link Violation , 1989 .

[49]  J. Carbonell,et al.  Supplementary Materials For: "Domain Adaptation with Invariant Representation Learning: What Transformations to Learn?" , 2021 .

[50]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..