论文信息 - TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference - 字舞流文

TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference

Existing pre-trained language models (PLMs) are often computationally expensive in inference, making them impractical in various resource-limited real-world applications. To address this issue, we propose a dynamic token reduction approach to accelerate PLMs’ inference, named TR-BERT, which could flexibly adapt the layer number of each token in inference to avoid redundant calculation. Specially, TR-BERT formulates the token reduction process as a multi-step token selection problem and automatically learns the selection strategy via reinforcement learning. The experimental results on several downstream NLP tasks show that TR-BERT is able to speed up BERT by 2-5 times to satisfy various performance demands. Moreover, TR-BERT can also achieve better performance with less computation in a suite of long-text tasks since its token-level layer number adaption greatly accelerates the self-attention operation in PLMs. The source code and experiment details of this paper can be obtained from https://github.com/thunlp/TR-BERT.

Maosong Sun | Deming Ye | Yankai Lin | Yufei Huang | Maosong Sun | Yankai Lin | Deming Ye | Yufei Huang

[1] Li Yang,et al. Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[2] Christopher Clark,et al. Simple and Effective Multi-Paragraph Reading Comprehension , 2017, ACL.

[3] Robert Frank,et al. Open Sesame: Getting inside BERT’s Linguistic Knowledge , 2019, BlackboxNLP@ACL.

[4] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[5] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[6] Anna Rumshisky,et al. A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[7] Yoshua Bengio,et al. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[8] Peng Zhou,et al. FastBERT: a Self-distilling BERT with Adaptive Inference Time , 2020, ACL.

[9] Omer Levy,et al. Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[10] Yiming Yang,et al. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[11] Arman Cohan,et al. Longformer: The Long-Document Transformer , 2020, ArXiv.

[12] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13] Shiyu Chang,et al. The Lottery Ticket Hypothesis for Pre-trained BERT Networks , 2020, NeurIPS.

[14] Christopher D. Manning,et al. A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[15] Kevin Gimpel,et al. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[16] Paul N. Bennett,et al. Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention , 2020, ICLR.

[17] Qun Liu,et al. TernaryBERT: Distillation-aware Ultra-low Bit BERT , 2020, EMNLP.

[18] Guokun Lai,et al. RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[19] Mohit Bansal,et al. Revealing the Importance of Semantic Retrieval for Machine Reading at Scale , 2019, EMNLP.

[20] Lukasz Kaiser,et al. Universal Transformers , 2018, ICLR.

[21] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[22] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[23] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[24] Ming-Wei Chang,et al. Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[25] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[26] Xiang Zhang,et al. Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[27] Yu Cheng,et al. Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[28] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29] Eunsol Choi,et al. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[30] Qun Liu,et al. DynaBERT: Dynamic BERT with Adaptive Width and Depth , 2020, NeurIPS.

[31] Jonathan Berant,et al. MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension , 2019, ACL.

[32] Edouard Grave,et al. Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[33] Michael W. Mahoney,et al. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.

[34] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[35] Jimmy J. Lin,et al. DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference , 2020, ACL.

[36] Niranjan Balasubramanian,et al. DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering , 2020, ACL.

[37] Sebastian Riedel,et al. Constructing Datasets for Multi-hop Reading Comprehension Across Documents , 2017, TACL.

[38] Hai Zhao,et al. Semantics-aware BERT for Language Understanding , 2020, AAAI.

[39] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[40] Li Zhao,et al. Learning Structured Representation for Text Classification via Reinforcement Learning , 2018, AAAI.

[41] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[42] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[43] Ken Lang,et al. NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[44] Philip Bachman,et al. NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.

[45] Yimeng Zhuang,et al. Token-level Dynamic Self-Attention Network for Multi-Passage Reading Comprehension , 2019, ACL.

[46] Luyao Huang,et al. Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence , 2019, NAACL.

[47] Benno Stein,et al. SemEval-2019 Task 4: Hyperpartisan News Detection , 2019, *SEMEVAL.

[48] Anamitra R. Choudhury,et al. PoWER-BERT: Accelerating BERT inference for Classification Tasks , 2020, ArXiv.

[49] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[50] Eunsol Choi,et al. MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension , 2019, MRQA@EMNLP.

[51] Qun Liu,et al. TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[52] Kyunghyun Cho,et al. Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search , 2020, ACL.

[53] Guokun Lai,et al. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing , 2020, NeurIPS.