The Power and Limitation of Pretraining-Finetuning for Linear Regression under Covariate Shift

We study linear regression under covariate shift, where the marginal distribution over the input covariates differs in the source and the target domains, while the conditional distribution of the output given the input covariates is similar across the two domains. We investigate a transfer learning approach with pretraining on the source data and finetuning based on the target data (both conducted by online SGD) for this problem. We establish sharp instance-dependent excess risk upper and lower bounds for this approach. Our bounds suggest that for a large class of linear regression instances, transfer learning with O ( N 2 ) source data (and scarce or no target data) is as effective as supervised learning with N target data. In addition, we show that finetuning, even with only a small amount of target data, could drastically reduce the amount of source data required by pretraining. Our theory sheds light on the effectiveness and limitation of pretraining as well as the benefits of finetuning for tackling covariate shift problems.

[1]  M. Wainwright,et al.  Optimally tackling covariate shift in RKHS-based nonparametric regression , 2022, ArXiv.

[2]  M. Wainwright,et al.  A new similarity measure for covariate shift with applications to nonparametric regression , 2022, ICML.

[3]  S. Kakade,et al.  Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression , 2021, ICML.

[4]  Dean P. Foster,et al.  The Benefits of Implicit Regularization from SGD in Least Squares Problems , 2021, NeurIPS.

[5]  Qi Lei,et al.  Near-Optimal Linear Regression under Distribution Shift , 2021, ICML.

[6]  Vladimir Braverman,et al.  Benign Overfitting of Constant-Stepsize SGD for Linear Regression , 2021, COLT.

[7]  Nicolas Flammarion,et al.  Last iterate convergence of SGD for Least-Squares in the Interpolation regime , 2021, NeurIPS.

[8]  A. Tsigler,et al.  Benign overfitting in ridge regression , 2020 .

[9]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[10]  Steve Hanneke,et al.  On the Value of Target Data in Transfer Learning , 2020, NeurIPS.

[11]  David Lopez-Paz,et al.  Invariant Risk Minimization , 2019, ArXiv.

[12]  Sham M. Kakade,et al.  The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure , 2019, NeurIPS.

[13]  Yifan Wu,et al.  Domain Adaptation with Asymmetrically-Relaxed Distribution Alignment , 2019, ICML.

[14]  Mehryar Mohri,et al.  Adaptation Based on Generalized Discrepancy , 2019, J. Mach. Learn. Res..

[15]  Samory Kpotufe,et al.  Marginal Singularity, and the Benefits of Labels in Covariate-Shift , 2018, COLT.

[16]  Prateek Jain,et al.  A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares) , 2017, FSTTCS.

[17]  Prateek Jain,et al.  Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification , 2016, J. Mach. Learn. Res..

[18]  Francis R. Bach,et al.  Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression , 2016, J. Mach. Learn. Res..

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[21]  Mehryar Mohri,et al.  Domain adaptation and sample bias correction theory and algorithm for regression , 2014, Theor. Comput. Sci..

[22]  François Laviolette,et al.  A PAC-Bayesian Approach for Domain Adaptation with Specialization to Linear Classifiers , 2013, ICML.

[23]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[24]  Mehryar Mohri,et al.  New Analysis and Algorithm for Learning with Drifting Distributions , 2012, ALT.

[25]  Motoaki Kawanabe,et al.  Machine Learning in Non-Stationary Environments - Introduction to Covariate Shift Adaptation , 2012, Adaptive computation and machine learning.

[26]  Yishay Mansour,et al.  Learning Bounds for Importance Weighting , 2010, NIPS.

[27]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[28]  Tyler Lu,et al.  Impossibility Theorems for Domain Adaptation , 2010, AISTATS.

[29]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[30]  Yishay Mansour,et al.  Domain Adaptation: Learning Bounds and Algorithms , 2009, COLT.

[31]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[32]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[33]  R. R. Bahadur Some Limit Theorems in Statistics , 1987 .

[34]  R. R. Bahadur Rates of Convergence of Estimates and Test Statistics , 1967 .