Rethinking Importance Weighting for Transfer Learning

A key assumption in supervised learning is that training and test data follow the same probability distribution. However, this fundamental assumption is not always satisfied in practice, e.g., due to changing environments, sample selection bias, privacy concerns, or high labeling costs. Transfer learning (TL) relaxes this assumption and allows us to learn under distribution shift. Classical TL methods typically rely on importance-weighting—a predictor is trained based on the training losses weighted according to the importance (i.e., the testover-training density ratio). However, as real-world machine learning tasks are becoming increasingly complex, high-dimensional, and dynamical, novel approaches are explored to cope with such challenges recently. In this article, after introducing the foundation of TL based on importance-weighting, we review recent advances based on joint and dynamic importancepredictor estimation. Furthermore, we introduce a method of causal mechanism transfer that incorporates causal structure in TL. Finally, we discuss future perspectives of TL research.

[1]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[2]  A One-step Approach to Covariate Shift Adaptation , 2020, ACML.

[3]  Motoaki Kawanabe,et al.  Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation , 2007, NIPS.

[4]  Bin Yang,et al.  Learning to Reweight Examples for Robust Deep Learning , 2018, ICML.

[5]  F. Wolak,et al.  Structural Econometric Modeling: Rationales and Examples from Industrial Organization , 2004 .

[6]  Takafumi Kanamori,et al.  Relative Density-Ratio Estimation for Robust Distribution Comparison , 2011, Neural Computation.

[7]  Aditya Krishna Menon,et al.  Learning with Symmetric Label Noise: The Importance of Being Unhinged , 2015, NIPS.

[8]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[9]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[10]  Keith Worden,et al.  On the application of domain adaptation in structural health monitoring , 2020, Mechanical Systems and Signal Processing.

[11]  Atsuto Maki,et al.  A systematic study of the class imbalance problem in convolutional neural networks , 2017, Neural Networks.

[12]  R. Rifkin,et al.  Notes on Regularized Least Squares , 2007 .

[14]  Jure Leskovec,et al.  WILDS: A Benchmark of in-the-Wild Distribution Shifts , 2021, ICML.

[15]  Aapo Hyvärinen,et al.  Nonlinear ICA Using Auxiliary Variables and Generalized Contrastive Learning , 2018, AISTATS.

[16]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[17]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[18]  A. Winsor Sampling techniques. , 2000, Nursing times.

[19]  J. Tukey,et al.  The Fitting of Power Series, Meaning Polynomials, Illustrated on Band-Spectroscopic Data , 1974 .

[20]  Mehryar Mohri,et al.  Sample Selection Bias Correction Theory , 2008, ALT.

[21]  Erik B. Sudderth Introduction to statistical machine learning , 2016 .

[22]  M. Weisz Econometric Analysis Of Panel Data , 2016 .

[23]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[24]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[25]  Trevor Darrell,et al.  BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling , 2018, ArXiv.

[26]  Alex Lamb,et al.  Deep Learning for Classical Japanese Literature , 2018, ArXiv.

[27]  Vladimir Koltchinskii,et al.  Rademacher penalties and structural risk minimization , 2001, IEEE Trans. Inf. Theory.

[28]  Robert C. Williamson,et al.  A Theory of Learning with Corrupted Labels , 2017, J. Mach. Learn. Res..

[29]  Motoaki Kawanabe,et al.  Machine Learning in Non-Stationary Environments - Introduction to Covariate Shift Adaptation , 2012, Adaptive computation and machine learning.

[30]  Colin Wei,et al.  Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss , 2019, NeurIPS.

[31]  Seungjin Choi,et al.  Independent Component Analysis , 2009, Handbook of Natural Computing.

[32]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[33]  Xingrui Yu,et al.  Co-teaching: Robust training of deep neural networks with extremely noisy labels , 2018, NeurIPS.

[34]  H. Robbins A Stochastic Approximation Method , 1951 .

[35]  P. Spirtes,et al.  Review of Causal Discovery Methods Based on Graphical Models , 2019, Front. Genet..

[36]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[37]  Brian D. Ziebart,et al.  Robust Covariate Shift Regression , 2016, AISTATS.

[38]  Chen Huang,et al.  Learning Deep Representation for Imbalanced Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  D. Angluin,et al.  Learning From Noisy Examples , 1988, Machine Learning.

[40]  Bernhard Schölkopf,et al.  Elements of Causal Inference: Foundations and Learning Algorithms , 2017 .

[41]  H. Kahn,et al.  Methods of Reducing Sample Size in Monte Carlo Computations , 1953, Oper. Res..

[42]  Masashi Sugiyama,et al.  Input-dependent estimation of generalization error under covariate shift , 2005 .

[43]  Nicolas Lachiche,et al.  Dataset Shift in a Real-Life Dataset , 2014 .

[44]  Peter Stone,et al.  Boosting for Regression Transfer , 2010, ICML.

[45]  Masashi Sugiyama,et al.  Few-shot Domain Adaptation by Causal Mechanism Transfer , 2020, ICML.

[46]  Masashi Sugiyama,et al.  Rethinking Importance Weighting for Deep Learning under Distribution Shift , 2020, NeurIPS.

[47]  Mehryar Mohri,et al.  Adaptation Based on Generalized Discrepancy , 2019, J. Mach. Learn. Res..

[48]  Bernhard Schölkopf,et al.  Semi-Supervised Domain Adaptation with Non-Parametric Copulas , 2012, NIPS.

[49]  Klaus-Robert Müller,et al.  Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..

[50]  Bernhard Schölkopf,et al.  Domain Adaptation with Conditional Transferable Components , 2016, ICML.

[51]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[52]  Aapo Hyvärinen,et al.  Causal Discovery with General Non-Linear Relationships using Non-Linear ICA , 2019, UAI.

[53]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[54]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[55]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[56]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[57]  Masashi Sugiyama,et al.  Coupling-based Invertible Neural Networks Are Universal Diffeomorphism Approximators , 2020, NeurIPS.

[58]  Yu Wang,et al.  Learning to Adapt to Evolving Domains , 2020, NeurIPS.

[59]  R. Berk An introduction to sample selection bias in sociological data. , 1983 .

[60]  Badi H. Baltagi,et al.  Gasoline demand in the OECD: An application of pooling and testing procedures , 1983 .

[61]  XuLei Yang,et al.  Weighted support vector machine for data classification , 2005 .

[62]  Stéphan Clémençon,et al.  SGD Algorithms based on Incomplete U-statistics: Large-Scale Minimization of Empirical Risk , 2015, NIPS.

[63]  Stéphan Clémençon,et al.  Scaling-up Empirical Risk Minimization: Optimization of Incomplete $U$-statistics , 2015, J. Mach. Learn. Res..

[64]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[65]  Dacheng Tao,et al.  Label-Noise Robust Domain Adaptation , 2020, ICML.

[66]  M. Kawanabe,et al.  Direct importance estimation for covariate shift adaptation , 2008 .

[67]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[68]  Zachary C. Lipton,et al.  What is the Effect of Importance Weighting in Deep Learning? , 2018, ICML.

[69]  Y. Kano,et al.  Causal Inference Using Nonnormality , 2004 .

[70]  Mehryar Mohri,et al.  Domain adaptation and sample bias correction theory and algorithm for regression , 2014, Theor. Comput. Sci..

[71]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Masashi Sugiyama,et al.  Mixture Regression for Covariate Shift , 2006, NIPS.

[73]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[74]  Bernhard Schölkopf,et al.  Correcting Sample Selection Bias by Unlabeled Data , 2006, NIPS.

[75]  Shai Ben-David,et al.  On the difficulty of approximately maximizing agreements , 2000, J. Comput. Syst. Sci..

[76]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[77]  Ronald L. Wasserstein,et al.  Monte Carlo: Concepts, Algorithms, and Applications , 1997 .

[78]  Yoav Goldberg,et al.  Transfer Learning for Related Reinforcement Learning Tasks via Image-to-Image Translation , 2018, ICML.

[79]  Takafumi Kanamori,et al.  Density Ratio Estimation in Machine Learning , 2012 .

[80]  Alexander J. Smola,et al.  Detecting and Correcting for Label Shift with Black Box Predictors , 2018, ICML.

[81]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[82]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[83]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[84]  Aapo Hyvärinen,et al.  A Linear Non-Gaussian Acyclic Model for Causal Discovery , 2006, J. Mach. Learn. Res..

[85]  G. Wahba Spline models for observational data , 1990 .

[86]  Bernhard Schölkopf,et al.  Domain Adaptation under Target and Conditional Shift , 2013, ICML.

[87]  Tengyu Ma,et al.  Understanding Self-Training for Gradual Domain Adaptation , 2020, ICML.

[88]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[89]  Takafumi Kanamori,et al.  A Least-squares Approach to Direct Importance Estimation , 2009, J. Mach. Learn. Res..