论文信息 - IsoBN: Fine-Tuning BERT with Isotropic Batch Normalization - 字舞流文

IsoBN: Fine-Tuning BERT with Isotropic Batch Normalization

Fine-tuning pre-trained language models (PTLMs), such as BERT and its better variant RoBERTa, has been a common practice for advancing performance in natural language understanding (NLU) tasks. Recent advance in representation learning shows that isotropic (i.e., unit-variance and uncorrelated) embeddings can significantly improve performance on downstream tasks with faster convergence and better generalization. The isotropy of the pre-trained embeddings in PTLMs, however, is relatively under-explored. In this paper, we analyze the isotropy of the pre-trained [CLS] embeddings of PTLMs with straightforward visualization, and point out two major issues: high variance in their standard deviation, and high correlation between different dimensions. We also propose a new network regularization method, isotropic batch normalization (IsoBN) to address the issues, towards learning more isotropic representations in fine-tuning. This simple yet effective fine-tuning method yields about 1.0 absolute increment on the average of seven benchmark NLU tasks.

Bill Yuchen Lin | Xiang Ren | Wenxuan Zhou | Xiang Ren | Wenxuan Zhou

[1] Jianfeng Gao,et al. SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization , 2019, ACL.

[2] Sanjeev Arora,et al. A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[3] Quoc V. Le,et al. BAM! Born-Again Multi-Task Networks for Natural Language Understanding , 2019, ACL.

[4] Klaus-Robert Müller,et al. Deep Boltzmann Machines and the Centering Trick , 2012, Neural Networks: Tricks of the Trade.

[5] Noah A. Smith,et al. To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks , 2019, RepL4NLP@ACL.

[6] Lei Huang,et al. Iterative Normalization: Beyond Standardization Towards Efficient Whitening , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Jian Sun,et al. Towards Stabilizing Batch Statistics in Backward Propagation of Batch Normalization , 2020, ICLR.

[8] Ross B. Girshick,et al. Reducing Overfitting in Deep Networks by Decorrelating Representations , 2015, ICLR.

[9] Di He,et al. Representation Degeneration Problem in Training Natural Language Generation Models , 2019, ICLR.

[10] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[11] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Pramod Viswanath,et al. All-but-the-Top: Simple and Effective Postprocessing for Word Representations , 2017, ICLR.

[13] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[14] Sergey Ioffe,et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[15] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16] Jing Huang,et al. Improving Neural Language Generation with Spectrum Control , 2020, ICLR.

[17] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[18] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[19] Xiaodong Liu,et al. Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.

[20] Kurt Keutzer,et al. Rethinking Batch Normalization in Transformers , 2020, ICML 2020.

[21] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[22] Lei Huang,et al. Decorrelated Batch Normalization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23] Yu Cheng,et al. FreeLB: Enhanced Adversarial Training for Natural Language Understanding , 2020, ICLR.

[24] Carla P. Gomes,et al. Understanding Batch Normalization , 2018, NeurIPS.

[25] Andrzej Cichocki,et al. Kernel PCA for Feature Extraction and De-Noising in Nonlinear Regression , 2001, Neural Computing & Applications.

[26] Thomas Wolf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[27] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[28] Nicol N. Schraudolph,et al. Accelerated Gradient Descent by Factor-Centering Decomposition , 1998 .