What Happens To BERT Embeddings During Fine-tuning?

While there has been much recent work studying how linguistic information is encoded in pre-trained sentence representations, comparatively little is understood about how these models change when adapted to solve downstream tasks. Using a suite of analysis techniques (probing classifiers, Representational Similarity Analysis, and model ablations), we investigate how fine-tuning affects the representations of the BERT model. We find that while fine-tuning necessarily makes significant changes, it does not lead to catastrophic forgetting of linguistic phenomena. We instead find that fine-tuning primarily affects the top layers of BERT, but with noteworthy variation across tasks. In particular, dependency parsing reconfigures most of the model, whereas SQuAD and MNLI appear to involve much shallower processing. Finally, we also find that fine-tuning has a weaker effect on representations of out-of-domain sentences, suggesting room for improvement in model generalization.

[1]  Yonatan Belinkov,et al.  Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.

[2]  Noah A. Smith,et al.  Is Attention Interpretable? , 2019, ACL.

[3]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[4]  Allyson Ettinger,et al.  Assessing Composition in Sentence Vector Representations , 2018, COLING.

[5]  Omer Levy,et al.  Deep RNNs Encode Soft Hierarchical Syntax , 2018, ACL.

[6]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[7]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[8]  Yang Liu,et al.  On Identifiability in Transformers , 2020, ICLR.

[9]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[10]  R. Thomas McCoy,et al.  Syntactic Data Augmentation Increases Robustness to Inference Heuristics , 2020, ACL.

[11]  Nizar Habash,et al.  CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2017, CoNLL.

[12]  Samuel R. Bowman,et al.  Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis , 2018, BlackboxNLP@EMNLP.

[13]  Alexander Löser,et al.  How Does BERT Answer Questions?: A Layer-Wise Analysis of Transformer Representations , 2019, CIKM.

[14]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[15]  Willem Zuidema,et al.  Blackbox Meets Blackbox: Representational Similarity & Stability Analysis of Neural Language Models and Brains , 2019, BlackboxNLP@ACL.

[16]  Ivan Titov,et al.  Information-Theoretic Probing with Minimum Description Length , 2020, EMNLP.

[17]  Yonatan Belinkov,et al.  Analyzing the Structure of Attention in a Transformer Language Model , 2019, BlackboxNLP@ACL.

[18]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[19]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[20]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[21]  R. Thomas McCoy,et al.  BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance , 2019, BLACKBOXNLP.

[22]  Edouard Grave,et al.  Colorless Green Recurrent Networks Dream Hierarchically , 2018, NAACL.

[23]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[24]  Adam R. Teichert,et al.  Semantic Proto-Role Labeling , 2017, AAAI.

[25]  Alex Wang,et al.  jiant: A Software Toolkit for Research on General-Purpose Text Understanding Models , 2020, ACL.

[26]  Noah Goodman,et al.  Investigating Transferability in Pretrained Language Models , 2020, EMNLP.

[27]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[28]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Ryan Cotterell,et al.  Information-Theoretic Probing for Linguistic Structure , 2020, ACL.

[30]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[31]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[32]  John Hewitt,et al.  Designing and Interpreting Probes with Control Tasks , 2019, EMNLP.

[33]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[34]  Garrison W. Cottrell,et al.  Content and cluster analysis: Assessing representational similarity in neural systems , 2000 .

[35]  Adam Lopez,et al.  Understanding Learning Dynamics Of Language Models with SVCCA , 2018, NAACL.

[36]  Martin Wattenberg,et al.  Visualizing and Measuring the Geometry of BERT , 2019, NeurIPS.

[37]  Sameer Singh,et al.  Do NLP Models Know Numbers? Probing Numeracy in Embeddings , 2019, EMNLP.

[38]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[39]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[40]  Lei Yu,et al.  Learning and Evaluating General Linguistic Intelligence , 2019, ArXiv.

[41]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[42]  Rico Sennrich,et al.  The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives , 2019, EMNLP.

[43]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[44]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[45]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[46]  Nikolaus Kriegeskorte,et al.  Frontiers in Systems Neuroscience Systems Neuroscience , 2022 .

[47]  Afra Alishahi,et al.  Correlating Neural and Symbolic Representations of Language , 2019, ACL.

[48]  Alex Wang,et al.  What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[49]  Yonatan Belinkov,et al.  What do Neural Machine Translation Models Learn about Morphology? , 2017, ACL.

[50]  Felix Hill,et al.  Higher-order Comparisons of Sentence Encoder Representations , 2019, EMNLP.

[51]  Samuel R. Bowman,et al.  A Gold Standard Dependency Corpus for English , 2014, LREC.

[52]  Preslav Nakov,et al.  SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals , 2009, SEW@NAACL-HLT.

[53]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[54]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[55]  Sebastian Gehrmann,et al.  exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformers Models , 2019, ArXiv.

[56]  Luke S. Zettlemoyer,et al.  Dissecting Contextual Word Embeddings: Architecture and Representation , 2018, EMNLP.

[57]  Noah A. Smith,et al.  To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks , 2019, RepL4NLP@ACL.

[58]  Willem H. Zuidema,et al.  Visualisation and 'diagnostic classifiers' reveal how recurrent and recursive neural networks process hierarchical structure , 2017, J. Artif. Intell. Res..

[59]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[60]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[61]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[62]  Timothy Dozat,et al.  Deep Biaffine Attention for Neural Dependency Parsing , 2016, ICLR.

[63]  Allyson Ettinger What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models , 2019, Transactions of the Association for Computational Linguistics.

[64]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[65]  Yonatan Belinkov,et al.  Analysis Methods in Neural Language Processing: A Survey , 2018, TACL.

[66]  Asim Kadav,et al.  Teaching Syntax by Adversarial Distraction , 2018, ArXiv.

[67]  Ryan Cotterell,et al.  A Tale of a Probe and a Parser , 2020, ACL.

[68]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[69]  Furu Wei,et al.  Visualizing and Understanding the Effectiveness of BERT , 2019, EMNLP.

[70]  Tal Linzen,et al.  Targeted Syntactic Evaluation of Language Models , 2018, EMNLP.

[71]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[72]  Samy Bengio,et al.  Insights on representational similarity in neural networks with canonical correlation , 2018, NeurIPS.

[73]  Roger Levy,et al.  Linking artificial and human neural representations of language , 2019, EMNLP.

[74]  Sara Veldhoen,et al.  Diagnostic Classifiers Revealing how Neural Networks Process Hierarchical Structure , 2016, CoCo@NIPS.

[75]  Yoav Goldberg,et al.  Assessing BERT's Syntactic Abilities , 2019, ArXiv.