Information-Theoretic Probing for Linguistic Structure

The success of neural networks on a diverse set of NLP tasks has led researchers to question how much these networks actually “know” about natural language. Probes are a natural way of assessing this. When probing, a researcher chooses a linguistic task and trains a supervised model to predict annotations in that linguistic task from the network’s learned representations. If the probe does well, the researcher may conclude that the representations encode knowledge related to the task. A commonly held belief is that using simpler models as probes is better; the logic is that simpler models will identify linguistic structure, but not learn the task itself. We propose an information-theoretic operationalization of probing as estimating mutual information that contradicts this received wisdom: one should always select the highest performing probe one can, even if it is more complex, since it will result in a tighter estimate, and thus reveal more of the linguistic information inherent in the representation. The experimental portion of our paper focuses on empirically estimating the mutual information between a linguistic property and BERT, comparing these estimates to several baselines. We evaluate on a set of ten typologically diverse languages often underrepresented in NLP research—plus English—totalling eleven languages. Our implementation is available in https://github.com/rycolab/info-theoretic-probing.

[1]  J. Fodor,et al.  The Psychology of Language: An Introduction to Psycholinguistics and Generative Grammar , 1976 .

[2]  S. Carey The child as word learner , 1978 .

[3]  M. Coltheart,et al.  The quarterly journal of experimental psychology , 1985 .

[4]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[5]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[6]  S. Levinson Presumptive Meanings: The theory of generalized conversational implicature , 2001 .

[7]  Frank Wijnen,et al.  Storage and Computation in the Language Faculty , 2002 .

[8]  Lewis Bott,et al.  Making disjunctions exclusive , 2008, Quarterly journal of experimental psychology.

[9]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[10]  Jason Baldridge,et al.  Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages , 2013, ACL.

[11]  Jonathan W. Pillow,et al.  Bayesian entropy estimation for countable discrete distributions , 2013, J. Mach. Learn. Res..

[12]  Julie S. Amberg,et al.  Introduction: What is language? , 2009 .

[13]  Yonatan Belinkov,et al.  Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems , 2017, NIPS.

[14]  Yoshua Bengio,et al.  Understanding intermediate layers using linear classifier probes , 2016, ICLR.

[15]  Yonatan Belinkov,et al.  Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks , 2017, IJCNLP.

[16]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[17]  Carolyn Penstein Rosé,et al.  Stress Test Evaluation for Natural Language Inference , 2018, COLING.

[18]  Samuel R. Bowman,et al.  Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis , 2018, BlackboxNLP@EMNLP.

[19]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[20]  Luke S. Zettlemoyer,et al.  Dissecting Contextual Word Embeddings: Architecture and Representation , 2018, EMNLP.

[21]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[22]  John Hewitt,et al.  Designing and Interpreting Probes with Control Tasks , 2019, EMNLP.

[23]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[24]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[25]  Adam Lopez,et al.  Understanding Learning Dynamics Of Language Models with SVCCA , 2018, NAACL.

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  Alex Wang,et al.  What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[28]  Yonatan Belinkov,et al.  Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.

[29]  Stefano Ermon,et al.  A Theory of Usable Information Under Computational Constraints , 2020, ICLR.

[30]  Gözde Gül Sahin,et al.  LINSPECTOR: Multilingual Probing Tasks for Word Representations , 2019, CL.

[31]  Ivan Titov,et al.  Information-Theoretic Probing with Minimum Description Length , 2020, EMNLP.