A Primer in BERTology: What We Know About How BERT Works

Abstract Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue, and approaches to compression. We then outline directions for future research.

[1]  M. Jackson What do you mean? , 1989, Geriatric nursing.

[2]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[3]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[4]  A. Goldberg Constructions at Work: The Nature of Generalization in Language , 2006 .

[5]  Slav Petrov,et al.  Products of Random Latent Variable Grammars , 2010, NAACL.

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[8]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[9]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Matt Crane,et al.  Questionable Answers in Question Answering Research: Reproducibility and Variability of Published Results , 2018, TACL.

[12]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[13]  Jörg Tiedemann,et al.  An Analysis of Encoder Representations in Transformer-Based Machine Translation , 2018, BlackboxNLP@EMNLP.

[14]  Pramod Viswanath,et al.  All-but-the-Top: Simple and Effective Postprocessing for Word Representations , 2017, ICLR.

[15]  Samuel R. Bowman,et al.  Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , 2018, ArXiv.

[16]  Yu Sun,et al.  ERNIE: Enhanced Representation through Knowledge Integration , 2019, ArXiv.

[17]  Sameer Singh,et al.  Do NLP Models Know Numbers? Probing Numeracy in Embeddings , 2019, EMNLP.

[18]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[19]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[20]  Rico Sennrich,et al.  The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives , 2019, EMNLP.

[21]  Yue Zhang,et al.  Using Dynamic Embeddings to Improve Static Embeddings , 2019, ArXiv.

[22]  Noah A. Smith,et al.  Is Attention Interpretable? , 2019, ACL.

[23]  Robert Frank,et al.  Open Sesame: Getting inside BERT’s Linguistic Knowledge , 2019, BlackboxNLP@ACL.

[24]  Xing Wu,et al.  Conditional BERT Contextual Augmentation , 2018, ICCS.

[25]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[26]  Lei Yu,et al.  Learning and Evaluating General Linguistic Intelligence , 2019, ArXiv.

[27]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[28]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[29]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[30]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[31]  Naveen Arivazhagan,et al.  Small and Practical BERT Models for Sequence Labeling , 2019, EMNLP.

[32]  Mark Dredze,et al.  Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[33]  Sourav Dutta,et al.  Whatcha lookin' at? DeepLIFTing BERT's Attention in Question Answering , 2019, ArXiv.

[34]  M. de Rijke,et al.  Understanding Multi-Head Attention in Abstractive Summarization , 2019, ArXiv.

[35]  Daniel Kondratyuk,et al.  75 Languages, 1 Model: Parsing Universal Dependencies Universally , 2019, EMNLP.

[36]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[37]  Tal Linzen,et al.  Quantity doesn’t buy quality syntax with neural language models , 2019, EMNLP.

[38]  Samuel Broscheit,et al.  Investigating Entity Knowledge in BERT with Simple Neural End-To-End Entity Linking , 2019, CoNLL.

[39]  James Demmel,et al.  Reducing BERT Pre-Training Time from 3 Days to 76 Minutes , 2019, ArXiv.

[40]  Shikha Bordia,et al.  Investigating BERT’s Knowledge of Language: Five Analysis Methods with NPIs , 2019, EMNLP.

[41]  Deepening Hidden Representations from Pre-trained Language Models for Natural Language Understanding , 2019, ArXiv.

[42]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[43]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[44]  Kees van Deemter,et al.  What do you mean, BERT? Assessing BERT as a Distributional Semantics Model , 2019, ArXiv.

[45]  J. Scott McCarley,et al.  Structured Pruning of a BERT-based Question Answering Model , 2019 .

[46]  Yanzhi Wang,et al.  Reweighted Proximal Pruning for Large-Scale Language Representation , 2019, ArXiv.

[47]  Shuohang Wang,et al.  What does BERT Learn from Multiple-Choice Reading Comprehension Datasets? , 2019, ArXiv.

[48]  Di He,et al.  Efficient Training of BERT by Progressively Stacking , 2019, ICML.

[49]  Alexander M. Fraser,et al.  How Language-Neutral is Multilingual BERT? , 2019, ArXiv.

[50]  Chandler May,et al.  On Measuring Social Biases in Sentence Encoders , 2019, NAACL.

[51]  Maosong Sun,et al.  ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[52]  Terry Regier,et al.  Does BERT agree? Evaluating knowledge of structure dependence through agreement relations , 2019, ArXiv.

[53]  Ming Zhou,et al.  Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks , 2019, EMNLP.

[54]  Iain Murray,et al.  BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning , 2019, ICML.

[55]  Yonatan Belinkov,et al.  Analysis Methods in Neural Language Processing: A Survey , 2018, TACL.

[56]  Rudolf Rosa,et al.  Inducing Syntactic Trees from BERT Representations , 2019, ArXiv.

[57]  A. Kreuzer,et al.  WaLDORf: Wasteless Language-model Distillation On Reading-comprehension , 2019, arXiv.org.

[58]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[59]  Yang Song,et al.  Extreme Language Model Compression with Optimal Subwords and Shared Projections , 2019, ArXiv.

[60]  BERT is Not a Knowledge Base (Yet): Factual Knowledge vs. Name-Based Reasoning in Unsupervised QA , 2019, ArXiv.

[61]  Eva Schlinger,et al.  How Multilingual is Multilingual BERT? , 2019, ACL.

[62]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[63]  Jungo Kasai,et al.  Understanding Commonsense Inference Aptitude of Deep Contextual Representations , 2019, Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing.

[64]  Tapio Salakoski,et al.  Is Multilingual BERT Fluent in Language Generation? , 2019, ArXiv.

[65]  Alexander Löser,et al.  How Does BERT Answer Questions?: A Layer-Wise Analysis of Transformer Representations , 2019, CIKM.

[66]  Ramesh Nallapati,et al.  Universal Text Representation from BERT: An Empirical Study , 2019, ArXiv.

[67]  Noah A. Smith,et al.  To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks , 2019, RepL4NLP@ACL.

[68]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[69]  Yuki Arase,et al.  Transfer Fine-Tuning: A BERT Case Study , 2019, EMNLP/IJCNLP.

[70]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation , 2019, ArXiv.

[71]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[72]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[73]  Kawin Ethayarajh,et al.  How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings , 2019, EMNLP.

[74]  Richard Socher,et al.  BERT is Not an Interlingua and the Bias of Tokenization , 2019, EMNLP.

[75]  Luke S. Zettlemoyer,et al.  Cloze-driven Pretraining of Self-attention Networks , 2019, EMNLP.

[76]  Yonatan Belinkov,et al.  Analyzing the Structure of Attention in a Transformer Language Model , 2019, BlackboxNLP@ACL.

[77]  Jimmy J. Lin,et al.  Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.

[78]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[79]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[80]  Shikha Bordia,et al.  Do Attention Heads in BERT Track Syntactic Dependencies? , 2019, ArXiv.

[81]  Gregor Wiedemann,et al.  Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings , 2019, KONVENS.

[82]  Improving BERT Fine-tuning with Embedding Normalization , 2019, ArXiv.

[83]  Jesse Vig,et al.  Visualizing Attention in Transformer-Based Language Representation Models , 2019, ArXiv.

[84]  Guoyin Wang,et al.  Syntax-Infused Transformer and BERT models for Machine Translation and Natural Language Understanding , 2019, ArXiv.

[85]  Noah A. Smith,et al.  Knowledge Enhanced Contextual Word Representations , 2019, EMNLP.

[86]  Vassilina Nikoulina,et al.  On the use of BERT for Neural Machine Translation , 2019, EMNLP.

[87]  Yejin Choi,et al.  Do Neural Language Representations Learn Physical Commonsense? , 2019, CogSci.

[88]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[89]  Yuval Pinter,et al.  Attention is not not Explanation , 2019, EMNLP.

[90]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[91]  T. Goldstein,et al.  FreeLB: Enhanced Adversarial Training for Language Understanding , 2019, ICLR 2020.

[92]  Furu Wei,et al.  Visualizing and Understanding the Effectiveness of BERT , 2019, EMNLP.

[93]  Ewan Dunbar,et al.  RNNs Implicitly Implement Tensor Product Representations , 2018, ICLR.

[94]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[95]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[96]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[97]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[98]  Alex Wang,et al.  What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[99]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[100]  Roland Vollgraf,et al.  Pooled Contextualized Embeddings for Named Entity Recognition , 2019, NAACL.

[101]  Rui Cao,et al.  Document Classification by Word Embeddings of BERT , 2019, PACLING.

[102]  Yann Dauphin,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[103]  Yoav Goldberg,et al.  Assessing BERT's Syntactic Abilities , 2019, ArXiv.

[104]  Hung-Yu Kao,et al.  Probing Neural Network Comprehension of Natural Language Arguments , 2019, ACL.

[105]  André F. T. Martins,et al.  Adaptively Sparse Transformers , 2019, EMNLP.

[106]  Yonatan Belinkov,et al.  Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.

[107]  Alexander M. Rush,et al.  Commonsense Knowledge Mining from Pretrained Models , 2019, EMNLP.

[108]  Leyang Cui,et al.  Evaluating Commonsense in Pre-trained Language Models , 2019, AAAI.

[109]  Omer Levy,et al.  Emergent linguistic structure in artificial neural networks trained by self-supervision , 2020, Proceedings of the National Academy of Sciences.

[110]  Xin Jiang,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[111]  Tie-Yan Liu,et al.  MPNet: Masked and Permuted Pre-training for Language Understanding , 2020, NeurIPS.

[112]  Samuel R. Bowman,et al.  Can neural networks acquire a structural bias from raw linguistic data? , 2020, CogSci.

[113]  Sunita Sarawagi,et al.  What’s in a Name? Are BERT Named Entity Representations just as Good for any other Name? , 2020, REPL4NLP.

[114]  Andreas Moshovos,et al.  GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[115]  Luchen Tan,et al.  SegaBERT: Pre-training of Segment-aware BERT for Language Understanding , 2020, ArXiv.

[116]  Kyunghyun Cho,et al.  Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models , 2019, ICLR.

[117]  Dan Klein,et al.  Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , 2020, ArXiv.

[118]  Kentaro Inui,et al.  Attention Module is Not Only a Weight: Analyzing Transformers with Vector Norms , 2020, ArXiv.

[119]  Furu Wei,et al.  BERT-of-Theseus: Compressing BERT by Progressive Module Replacing , 2020, EMNLP.

[120]  Sebastian Gehrmann,et al.  exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformers Models , 2019, ArXiv.

[121]  Eunah Cho,et al.  Data Augmentation using Pre-trained Transformer Models , 2020, LIFELONGNLP.

[122]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[123]  Yang Zhang,et al.  The Lottery Ticket Hypothesis for Pre-trained BERT Networks , 2020, NeurIPS.

[124]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[125]  Qun Liu,et al.  Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT , 2020, ACL.

[126]  Dan Roth,et al.  Cross-Lingual Ability of Multilingual BERT: An Empirical Study , 2019, ICLR.

[127]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[128]  Noah A. Smith,et al.  Improving Transformer Models by Reordering their Sublayers , 2019, ACL.

[129]  Anamitra R. Choudhury,et al.  PoWER-BERT: Accelerating BERT inference for Classification Tasks , 2020, ArXiv.

[130]  Michael W. Mahoney,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.

[131]  David Vilares,et al.  Parsing as Pretraining , 2020, AAAI.

[132]  Ta-Chun Su,et al.  SesameBERT: Attention for Anywhere , 2020, 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA).

[133]  Zhao Hai,et al.  Semantics-aware BERT for Language Understanding , 2019, AAAI.

[134]  J. Tiedemann,et al.  Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation , 2020, FINDINGS.

[135]  Ashish Sabharwal,et al.  What Does My QA Model Know? Devising Controlled Probes Using Expert Knowledge , 2019, Transactions of the Association for Computational Linguistics.

[136]  Jinho D. Choi,et al.  Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-based Question Answering , 2020, ACL.

[137]  Yoav Goldberg,et al.  oLMpics-On What Language Model Pre-training Captures , 2019, Transactions of the Association for Computational Linguistics.

[138]  Li Dong,et al.  MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , 2020, NeurIPS.

[139]  Wei Chu,et al.  Symmetric Regularization based BERT for Pair-wise Semantic Reasoning , 2019, SIGIR.

[140]  Alexander M. Rush,et al.  Movement Pruning: Adaptive Sparsity by Fine-Tuning , 2020, NeurIPS.

[141]  Ashish Sabharwal,et al.  Probing Natural Language Inference Models through Semantic Fragments , 2019, AAAI.

[142]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[143]  G P Shrivatsa Bhargav,et al.  Span Selection Pre-training for Question Answering , 2019, ACL.

[144]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[145]  Ryan Cotterell,et al.  Information-Theoretic Probing for Linguistic Structure , 2020, ACL.

[146]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[147]  Jihun Choi,et al.  Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction , 2020, ICLR.

[148]  Colin Raffel,et al.  How Much Knowledge Can You Pack into the Parameters of a Language Model? , 2020, EMNLP.

[149]  Hao Tian,et al.  ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[150]  Ke M. Tran,et al.  From English To Foreign Languages: Transferring Pre-trained Language Models , 2019, ArXiv.

[151]  Frank F. Xu,et al.  How Can We Know What Language Models Know? , 2019, Transactions of the Association for Computational Linguistics.

[152]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[153]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[154]  Yiyun Zhao,et al.  How does BERT’s attention change when you fine-tune? An analysis methodology and a case study in negation scope , 2020, ACL.

[155]  C. Heumann,et al.  On the Comparability of Pre-trained Language Models , 2020, SwissText/KONVENS.

[156]  Kentaro Inui,et al.  Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets , 2019, AAAI.

[157]  Anna Rumshisky,et al.  When BERT Plays the Lottery, All Tickets Are Winning , 2020, EMNLP.

[158]  Gustavo Aguilar,et al.  Knowledge Distillation from Internal Representations , 2019, AAAI Conference on Artificial Intelligence.

[159]  Anamitra R. Choudhury,et al.  PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination , 2020, ICML.

[160]  Felice Dell'Orletta,et al.  Contextual and Non-Contextual Word Embeddings: an in-depth Linguistic Investigation , 2020, REPL4NLP.

[161]  Kevin Gimpel,et al.  A Cross-Task Analysis of Text Span Representations , 2020, REPL4NLP.

[162]  On Identifiability in Transformers , 2019, ICLR.

[163]  Samuel R. Bowman,et al.  Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work? , 2020, ACL.

[164]  Mikel Artetxe,et al.  On the Cross-lingual Transferability of Monolingual Representations , 2019, ACL.

[165]  Hinrich Schütze,et al.  BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance , 2019, ACL.

[166]  Allyson Ettinger,et al.  What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models , 2019, TACL.

[167]  Iryna Gurevych,et al.  A Matter of Framing: The Impact of Linguistic Formalism on Probing Results , 2020, EMNLP.

[168]  Roi Reichart,et al.  PERL: Pivot-based Domain Adaptation for Pre-trained Deep Contextualized Embedding Models , 2020, Transactions of the Association for Computational Linguistics.

[169]  Graham Neubig,et al.  TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data , 2020, ACL.

[170]  Anna Rumshisky,et al.  Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks , 2020, AAAI.

[171]  Jianfeng Gao,et al.  SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization , 2019, ACL.

[172]  Ankur P. Parikh,et al.  Thieves on Sesame Street! Model Extraction of BERT-based APIs , 2019, ICLR.

[173]  Joey Tianyi Zhou,et al.  Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment , 2019, AAAI.

[174]  Lei Yu,et al.  A Mutual Information Maximization Perspective of Language Representation Learning , 2019, ICLR.

[175]  Ali Farhadi,et al.  Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping , 2020, ArXiv.

[176]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[177]  Alessandro Moschitti,et al.  TANDA: Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection , 2019, AAAI.

[178]  Yoav Goldberg,et al.  When Bert Forgets How To POS: Amnesic Probing of Linguistic Properties and MLM Predictions , 2020, ArXiv.

[179]  T. Goldstein,et al.  FreeLB: Enhanced Adversarial Training for Natural Language Understanding , 2019, ICLR.

[180]  Claire Cardie,et al.  Interpreting Pretrained Contextualized Representations via Reductions to Static Embeddings , 2020, ACL.

[181]  Luo Si,et al.  StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding , 2019, ICLR.

[182]  Dan Klein,et al.  Multilingual Alignment of Contextual Word Representations , 2020, ICLR.

[183]  Yu Wu,et al.  Does BERT Solve Commonsense Task via Commonsense Knowledge? , 2020, ArXiv.

[184]  Mitchell A. Gordon,et al.  Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning , 2020, REPL4NLP.

[185]  Florian Schmidt,et al.  BERT as a Teacher: Contextual Embeddings for Sequence-Level Reward , 2020, ArXiv.

[186]  Wei-Tsung Kao,et al.  Further Boosting BERT-based Models by Duplicating Existing Layers: Some Intriguing Phenomena inside BERT , 2020, ArXiv.

[187]  Edouard Grave,et al.  Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[188]  Ivan Titov,et al.  Information-Theoretic Probing with Minimum Description Length , 2020, EMNLP.

[189]  Jianfeng Gao,et al.  UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training , 2020, ICML.

[190]  Steven Schockaert,et al.  Inducing Relational Knowledge from BERT , 2019, AAAI.

[191]  Oren Etzioni,et al.  Green AI , 2019, Commun. ACM.

[192]  Wanxiang Che,et al.  Pre-Training with Whole Word Masking for Chinese BERT , 2019, ArXiv.

[193]  Xuanjing Huang,et al.  K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters , 2020, FINDINGS.

[194]  Zhiyuan Liu,et al.  KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation , 2019, Transactions of the Association for Computational Linguistics.

[195]  Goran Glavas,et al.  Is Supervised Syntactic Parsing Beneficial for Language Understanding Tasks? An Empirical Investigation , 2020, EACL.

[196]  Yin Yang,et al.  Compressing Large-Scale Transformer-Based Models: A Case Study on BERT , 2020, Transactions of the Association for Computational Linguistics.