Rethinking Importance Weighting for Deep Learning under Distribution Shift

Under distribution shift (DS) where the training data distribution differs from the test one, a powerful technique is importance weighting (IW) which handles DS in two separate steps: weight estimation (WE) estimates the test-over-training density ratio and weighted classification (WC) trains the classifier from weighted training data. However, IW cannot work well on complex data, since WE is incompatible with deep learning. In this paper, we rethink IW and theoretically show it suffers from a circular dependency: we need not only WE for WC, but also WC for WE where a trained deep classifier is used as the feature extractor (FE). To cut off the dependency, we try to pretrain FE from unweighted training data, which leads to biased FE. To overcome the bias, we propose an end-to-end solution dynamic IW that iterates between WE and WC and combines them in a seamless manner, and hence our WE can also enjoy deep networks and stochastic optimizers indirectly. Experiments with two representative DSs on Fashion-MNIST and CIFAR-10/100 demonstrate that dynamic IW compares favorably with state-of-the-art methods.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Xingrui Yu,et al.  How does Disagreement Help Generalization against Label Corruption? , 2019, ICML.

[3]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[4]  Tianyi Zhang,et al.  A One-Step Approach to Covariate Shift Adaptation , 2020, SN Computer Science.

[5]  Alexander J. Smola,et al.  Detecting and Correcting for Label Shift with Black Box Predictors , 2018, ICML.

[6]  Hans-Peter Kriegel,et al.  Integrating structured biological data by Kernel Maximum Mean Discrepancy , 2006, ISMB.

[7]  Koby Crammer,et al.  Analysis of Representations for Domain Adaptation , 2006, NIPS.

[8]  Colin Wei,et al.  Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss , 2019, NeurIPS.

[9]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[10]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[11]  Gilles Blanchard,et al.  Classification with Asymmetric Label Noise: Consistency and Maximal Denoising , 2013, COLT.

[12]  H. Robbins A Stochastic Approximation Method , 1951 .

[13]  Bernhard Schölkopf,et al.  Correcting Sample Selection Bias by Unlabeled Data , 2006, NIPS.

[14]  Richard Nock,et al.  Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Xingrui Yu,et al.  SIGUA: Forgetting May Make Learning with Noisy Labels More Robust , 2018, ICML.

[16]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[17]  Ivor W. Tsang,et al.  Masking: A New Perspective of Noisy Supervision , 2018, NeurIPS.

[18]  Motoaki Kawanabe,et al.  Machine Learning in Non-Stationary Environments - Introduction to Covariate Shift Adaptation , 2012, Adaptive computation and machine learning.

[19]  Cheng Soon Ong,et al.  Learning from Corrupted Binary Labels via Class-Probability Estimation , 2015, ICML.

[20]  Bin Yang,et al.  Learning to Reweight Examples for Robust Deep Learning , 2018, ICML.

[21]  Motoaki Kawanabe,et al.  Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation , 2007, NIPS.

[22]  M. Kawanabe,et al.  Direct importance estimation for covariate shift adaptation , 2008 .

[23]  Masashi Sugiyama,et al.  Clustering Unclustered Data: Unsupervised Binary Labeling of Two Datasets Having Different Class Balances , 2013, 2013 Conference on Technologies and Applications of Artificial Intelligence.

[24]  Gang Niu,et al.  Does Distributionally Robust Supervised Learning Give Robust Classifiers? , 2016, ICML.

[25]  Russell Greiner,et al.  Robust Learning under Uncertain Test Distributions: Relating Covariate Shift to Model Misspecification , 2014, ICML.

[26]  Mengjie Zhang,et al.  Scatter Component Analysis: A Unified Framework for Domain Adaptation and Domain Generalization , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Gang Niu,et al.  On the Minimal Supervision for Training Any Binary Classifier from Only Unlabeled Data , 2018, ICLR.

[28]  Takafumi Kanamori,et al.  Density Ratio Estimation in Machine Learning , 2012 .

[29]  Li Fei-Fei,et al.  MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels , 2017, ICML.

[30]  Dacheng Tao,et al.  Classification with Noisy Labels by Importance Reweighting , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Gang Niu,et al.  Parts-dependent Label Noise: Towards Instance-dependent Label Noise , 2020, ArXiv.

[32]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[33]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[34]  Aditya Krishna Menon,et al.  Learning with Symmetric Label Noise: The Importance of Being Unhinged , 2015, NIPS.

[35]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[38]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[39]  John C. Duchi,et al.  Stochastic Gradient Methods for Distributionally Robust Optimization with f-divergences , 2016, NIPS.

[40]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[41]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[42]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[43]  Gang Niu,et al.  Dual T: Reducing Estimation Error for Transition Matrix in Label-noise Learning , 2020, NeurIPS.

[44]  Xingrui Yu,et al.  Co-teaching: Robust training of deep neural networks with extremely noisy labels , 2018, NeurIPS.

[45]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[46]  Daniel Marcu,et al.  Domain Adaptation for Statistical Classifiers , 2006, J. Artif. Intell. Res..

[47]  Gang Niu,et al.  Searching to Exploit Memorization Effect in Learning with Noisy Labels , 2020, ICML.

[48]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[49]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[50]  Bernhard Schölkopf,et al.  Domain Adaptation with Conditional Transferable Components , 2016, ICML.

[51]  Bernhard Schölkopf,et al.  Domain Adaptation under Target and Conditional Shift , 2013, ICML.

[52]  Gang Niu,et al.  Are Anchor Points Really Indispensable in Label-Noise Learning? , 2019, NeurIPS.

[53]  Klaus-Robert Müller,et al.  Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..

[54]  Anja De Waegenaere,et al.  Robust Solutions of Optimization Problems Affected by Uncertain Probabilities , 2011, Manag. Sci..

[55]  Tatsuya Harada,et al.  Asymmetric Tri-training for Unsupervised Domain Adaptation , 2017, ICML.

[56]  Zachary C. Lipton,et al.  What is the Effect of Importance Weighting in Deep Learning? , 2018, ICML.

[57]  Takafumi Kanamori,et al.  A Least-squares Approach to Direct Importance Estimation , 2009, J. Mach. Learn. Res..

[58]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[59]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  John C. Duchi,et al.  Variance-based Regularization with Convex Objectives , 2016, NIPS.

[61]  Atsuto Maki,et al.  A systematic study of the class imbalance problem in convolutional neural networks , 2017, Neural Networks.

[62]  Gang Niu,et al.  Mitigating Overfitting in Supervised Classification from Two Unlabeled Datasets: A Consistent Risk Correction Approach , 2020, AISTATS.

[63]  Chen Huang,et al.  Learning Deep Representation for Imbalanced Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).