Long-Tailed Classification by Keeping the Good and Removing the Bad Momentum Causal Effect

As the class size grows, maintaining a balanced dataset across many classes is challenging because the data are long-tailed in nature; it is even impossible when the sample-of-interest co-exists with each other in one collectable unit, e.g., multiple visual instances in one image. Therefore, long-tailed classification is the key to deep learning at scale. However, existing methods are mainly based on re-weighting/re-sampling heuristics that lack a fundamental theory. In this paper, we establish a causal inference framework, which not only unravels the whys of previous methods, but also derives a new principled solution. Specifically, our theory shows that the SGD momentum is essentially a confounder in long-tailed classification. On one hand, it has a harmful causal effect that misleads the tail prediction biased towards the head. On the other hand, its induced mediation also benefits the representation learning and head prediction. Our framework elegantly disentangles the paradoxical effects of the momentum, by pursuing the direct causal effect caused by an input sample. In particular, we use causal intervention in training, and counterfactual reasoning in inference, to remove the "bad" while keep the "good". We achieve new state-of-the-arts on three long-tailed visual recognition benchmarks: Long-tailed CIFAR-10/-100, ImageNet-LT for image classification and LVIS for instance segmentation.

[1]  Hanwang Zhang,et al.  Interventional Few-Shot Learning , 2020, NeurIPS.

[2]  Jinhui Tang,et al.  Causal Intervention for Weakly-Supervised Semantic Segmentation , 2020, NeurIPS.

[3]  Yu Wang,et al.  Distribution-Balanced Loss for Multi-Label Classification in Long-Tailed Datasets , 2020, ECCV.

[4]  Zhiwu Lu,et al.  Counterfactual VQA: A Cause-Effect Look at Language Bias , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Sheng Tang,et al.  Overcoming Classifier Imbalance for Long-Tail Object Detection With Balanced Group Softmax , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yi Jiang,et al.  Learning to Segment the Tail , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Ming-Hsuan Yang,et al.  Rethinking Class-Balanced Methods for Long-Tailed Visual Recognition From a Domain Adaptation Perspective , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Junjie Yan,et al.  Equalization Loss for Long-Tailed Object Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Jianqiang Huang,et al.  Unbiased Scene Graph Generation From Biased Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yifan Sun,et al.  Deep Representation Learning on Long-Tailed Data: A Learnable Embedding Augmentation Perspective , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Xiu-Shen Wei,et al.  BBN: Bilateral-Branch Network With Cumulative Learning for Long-Tailed Visual Recognition , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Hanwang Zhang,et al.  Two Causal Principles for Improving Visual Dialog , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Saining Xie,et al.  Decoupling Representation and Classifier for Long-Tailed Recognition , 2019, ICLR.

[14]  Matthew S. Fritz,et al.  Mediation analysis. , 2019, Annual review of psychology.

[15]  Colin Wei,et al.  Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss , 2019, NeurIPS.

[16]  Kai Chen,et al.  MMDetection: Open MMLab Detection Toolbox and Benchmark , 2019, ArXiv.

[17]  Ross B. Girshick,et al.  LVIS: A Dataset for Large Vocabulary Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Stella X. Yu,et al.  Large-Scale Long-Tailed Recognition in an Open World , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yu-Chiang Frank Wang,et al.  A Closer Look at Few-shot Classification , 2019, ICLR.

[20]  Qi Xie,et al.  Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting , 2019, NeurIPS.

[21]  Wei Wu,et al.  Dynamic Curriculum Learning for Imbalanced Data Classification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Yang Song,et al.  Class-Balanced Loss Based on Effective Number of Samples , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Mélanie Frappier,et al.  The Book of Why: The New Science of Cause and Effect , 2018, Science.

[24]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[25]  Nikos Komodakis,et al.  Dynamic Few-Shot Visual Learning Without Forgetting , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Bin Yang,et al.  Learning to Reweight Examples for Robust Deep Learning , 2018, ICML.

[27]  Matthew A. Brown,et al.  Low-Shot Learning with Imprinted Weights , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[30]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[31]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[34]  Serge J. Belongie,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Ramprasaath R. Selvaraju,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, International Journal of Computer Vision.

[37]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[38]  J. Pearl,et al.  Causal Inference in Statistics: A Primer , 2016 .

[39]  Qingming Huang,et al.  Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks , 2015, ECCV.

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Bennamoun,et al.  Cost-Sensitive Learning of Deep Feature Representations From Imbalanced Data , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[42]  L. Keele The Statistics of Causal Inference: A View from Political Methodology , 2015, Political Analysis.

[43]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[45]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[46]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[47]  Lorenzo Richiardi,et al.  Mediation analysis in epidemiology: methods, interpretation and bias. , 2013, International journal of epidemiology.

[48]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[49]  Tyler J VanderWeele,et al.  A three-way decomposition of a total effect into direct, indirect, and interactive effects. , 2013, Epidemiology.

[50]  P. Austin An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies , 2011, Multivariate behavioral research.

[51]  Lars Schmidt-Thieme,et al.  BPR: Bayesian Personalized Ranking from Implicit Feedback , 2009, UAI.

[52]  Robert V. Brill,et al.  Applied Statistics and Probability for Engineers , 2004, Technometrics.

[53]  W. Reed The Pareto, Zipf and other power laws , 2001 .

[54]  Judea Pearl,et al.  Direct and Indirect Effects , 2001, UAI.

[55]  J. Pearl Causal diagrams for empirical research , 1995 .

[56]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[57]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[58]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[59]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[60]  J. Pearl,et al.  Cover Letter: a Cover Letter Must Accompany the Submitted Manuscript Stating That the on the Consistency Rule in Causal Inference , 2022 .