R-Drop: Regularized Dropout for Neural Networks

Dropout is a powerful and widely used technique to regularize the training of deep neural networks. Though effective and performing well, the randomness introduced by dropout causes unnegligible inconsistency between training and inference. In this paper, we introduce a simple consistency training strategy to regularize dropout, namely R-Drop, which forces the output distributions of different sub models generated by dropout to be consistent with each other. Specifically, for each training sample, R-Drop minimizes the bidirectional KL-divergence between the output distributions of two sub models sampled by dropout. Theoretical analysis reveals that R-Drop reduces the above inconsistency. Experiments on 5 widely used deep learning tasks (18 datasets in total), including neural machine translation, abstractive summarization, language understanding, language modeling, and image classification, show that R-Drop is universally effective. In particular, it yields substantial improvements when applied to fine-tune large-scale pre-trained models, e.g., ViT, RoBERTa-large, and BART, and achieves state-of-the-art (SOTA) performances with the vanilla Transformer model on WMT14 English→German translation (30.91 BLEU) and WMT14 English→French translation (43.95 BLEU), even surpassing models trained with extra large-scale data and expert-designed advanced variants of Transformer models. Our code is available at GitHub2.

[1]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[2]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[3]  Graham W. Taylor,et al.  Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[4]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[5]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[6]  Yaoliang Yu,et al.  Dropout with Expectation-linear Regularization , 2016, ICLR.

[7]  Richard Socher,et al.  A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation , 2018, ICLR.

[8]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[9]  Alexei Baevski,et al.  Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[10]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[11]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[12]  Heeyoul Choi,et al.  Self-Knowledge Distillation in Natural Language Processing , 2019, RANLP.

[13]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[14]  Tao Qin,et al.  Depth Growing for Neural Machine Translation , 2019, ACL.

[15]  Christopher D. Manning,et al.  Fast dropout training , 2013, ICML.

[16]  Dmitry P. Vetrov,et al.  Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[17]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[19]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[20]  Yue Wang,et al.  The scale-invariant space for attention layer in neural network , 2020, Neurocomputing.

[21]  Armen Aghajanyan,et al.  Better Fine-Tuning by Reducing Representational Collapse , 2020, ICLR.

[22]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[23]  Hossein Mobahi,et al.  Self-Distillation Amplifies Regularization in Hilbert Space , 2020, NeurIPS.

[24]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[25]  Yoshua Bengio,et al.  Fraternal Dropout , 2017, ICLR.

[26]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[27]  Lucas Beyer,et al.  Big Transfer (BiT): General Visual Representation Learning , 2020, ECCV.

[28]  Tianyu Gao,et al.  SimCSE: Simple Contrastive Learning of Sentence Embeddings , 2021, EMNLP.

[29]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[30]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[31]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[32]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[33]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[34]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[35]  Hai Zhao,et al.  Not All Attention Is All You Need , 2021, ArXiv.

[36]  Furu Wei,et al.  Scheduled DropHead: A Regularization Method for Transformer Models , 2020, FINDINGS.

[37]  S. Valaee,et al.  Survey of Dropout Methods for Deep Neural Networks , 2019, ArXiv.

[38]  Surya Ganguli,et al.  Analyzing noise in autoencoders and deep networks , 2014, ArXiv.

[39]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[40]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[41]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[42]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[43]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[44]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[45]  Colin Wei,et al.  The Implicit and Explicit Regularization Effects of Dropout , 2020, ICML.

[46]  Eduard Hovy,et al.  Manual and automatic evaluation of summaries , 2002, ACL 2002.

[47]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Erhardt Barth,et al.  Recurrent Dropout without Memory Loss , 2016, COLING.

[49]  Lei Zhang,et al.  SEED: Self-supervised Distillation For Visual Representation , 2021, ArXiv.

[50]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[51]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[52]  Brendan J. Frey,et al.  Adaptive dropout for training deep neural networks , 2013, NIPS.

[53]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[54]  Yao Zhao,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2020, ICML.

[55]  Jürgen Schmidhuber,et al.  Simplifying Neural Nets by Discovering Flat Minima , 1994, NIPS.

[56]  Junsong Yuan,et al.  Rethinking Soft Labels for Knowledge Distillation: A Bias-Variance Tradeoff Perspective , 2021, ICLR.

[57]  Lawrence Carin,et al.  MixKD: Towards Efficient Distillation of Large-scale Language Models , 2020, ICLR.

[58]  Dacheng Tao,et al.  Shakeout: A New Regularized Deep Neural Network Training Scheme , 2016, AAAI.

[59]  Noam Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.

[60]  Xuanjing Huang,et al.  DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks , 2019, ArXiv.

[61]  Yelong Shen,et al.  A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation , 2020, ArXiv.

[62]  Jingjing Xu,et al.  MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning , 2019, ArXiv.

[63]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[64]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[65]  Yuanzhi Li,et al.  Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning , 2020, International Conference on Learning Representations.

[66]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[67]  Behrouz Minaei,et al.  A survey of regularization strategies for deep models , 2019, Artificial Intelligence Review.

[68]  Xiaodong Gu,et al.  Towards dropout training for convolutional neural networks , 2015, Neural Networks.

[69]  Zachary Chase Lipton,et al.  Born Again Neural Networks , 2018, ICML.

[70]  Dmitry P. Vetrov,et al.  Structured Bayesian Pruning via Log-Normal Multiplicative Noise , 2017, NIPS.

[71]  Kaisheng Ma,et al.  Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[72]  Xianglong Liu,et al.  Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent Stiefel Manifolds in Deep Neural Networks , 2017, AAAI.

[73]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Ming Zhou,et al.  ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training , 2020, FINDINGS.

[75]  Tie-Yan Liu,et al.  Incorporating BERT into Neural Machine Translation , 2020, ICLR.

[76]  Shafiq Joty,et al.  Data Diversification: A Simple Strategy For Neural Machine Translation , 2020, NeurIPS.

[77]  Huchuan Lu,et al.  Deep Mutual Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[78]  Jianfeng Gao,et al.  Very Deep Transformers for Neural Machine Translation , 2020, ArXiv.

[79]  Pascal Vincent,et al.  The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training , 2009, AISTATS.

[80]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[81]  Quoc V. Le,et al.  AutoDropout: Learning Dropout Patterns to Regularize Deep Networks , 2021, AAAI.

[82]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.