DirectProbe: Studying Representations without Classifiers

Understanding how linguistic structure is encoded in contextualized embedding could help explain their impressive performance across NLP. Existing approaches for probing them usually call for training classifiers and use the accuracy, mutual information, or complexity as a proxy for the representation’s goodness. In this work, we argue that doing so can be unreliable because different representations may need different classifiers. We develop a heuristic, DirectProbe, that directly studies the geometry of a representation by building upon the notion of a version space for a task. Experiments with several linguistic tasks and contextualized embeddings show that, even without training classifiers, DirectProbe can shine lights on how an embedding space represents labels and also anticipate the classifier performance for the representation.

[1]  Hinrich Schutze,et al.  Negated LAMA: Birds cannot fly , 2019, ArXiv.

[2]  Nathan Schneider,et al.  Comprehensive Supersense Disambiguation of English Prepositions and Possessives , 2018, ACL.

[3]  Mrinmaya Sachan,et al.  Clustering Contextualized Representations of Text for Unsupervised Syntax Induction , 2020, ArXiv.

[4]  Julian Michael,et al.  Asking without Telling: Exploring Latent Ontologies in Contextual Representations , 2020, EMNLP.

[5]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[6]  Frank Rudzicz,et al.  An Information Theoretic View on Selecting Linguistic Probes , 2020, EMNLP.

[7]  Yuki Arase,et al.  Transfer Fine-Tuning: A BERT Case Study , 2019, EMNLP/IJCNLP.

[8]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[9]  Christian S. Perone,et al.  Evaluation of sentence embeddings in downstream and linguistic probing tasks , 2018, ArXiv.

[10]  Kawin Ethayarajh,et al.  How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings , 2019, EMNLP.

[11]  Archna Bhatia,et al.  Adposition and Case Supersenses v2.5: Guidelines for English , 2017, 1704.02134.

[12]  Ryan Cotterell,et al.  Pareto Probing: Trading-Off Accuracy and Complexity , 2020, EMNLP.

[13]  Yiming Yang,et al.  Predicting Performance for Natural Language Processing Tasks , 2020, ACL.

[14]  Alex Wang,et al.  Probing What Different NLP Tasks Teach Machines about Function Word Comprehension , 2019, *SEMEVAL.

[15]  Noah A. Smith,et al.  To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks , 2019, RepL4NLP@ACL.

[16]  Ivan Titov,et al.  Information-Theoretic Probing with Minimum Description Length , 2020, EMNLP.

[17]  Qun Liu,et al.  Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT , 2020, ACL.

[18]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[19]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[20]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[21]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[22]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[23]  Yoshua Bengio,et al.  Understanding intermediate layers using linear classifier probes , 2016, ICLR.

[24]  Kyunghyun Cho,et al.  Evaluating representations by the complexity of learning low-loss predictors , 2020, ArXiv.

[25]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[26]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[27]  Eneko Agirre,et al.  Probing for Semantic Classes: Diagnosing the Meaning Content of Word Embeddings , 2019, ACL.

[28]  John Hewitt,et al.  Designing and Interpreting Probes with Control Tasks , 2019, EMNLP.

[29]  Aykut Koç,et al.  Semantic Structure and Interpretability of Word Embeddings , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Tal Linzen,et al.  Targeted Syntactic Evaluation of Language Models , 2018, EMNLP.

[31]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Laure Thompson,et al.  The strange geometry of skip-gram with negative sampling , 2017, EMNLP.

[33]  Sameer Singh,et al.  Do NLP Models Know Numbers? Probing Numeracy in Embeddings , 2019, EMNLP.

[34]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[35]  Iryna Gurevych,et al.  Pitfalls in the Evaluation of Sentence Embeddings , 2019, RepL4NLP@ACL.

[36]  Joakim Nivre,et al.  Do Neural Language Models Show Preferences for Syntactic Formalisms? , 2020, ACL.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[39]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[40]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[41]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[42]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[43]  Yonatan Belinkov,et al.  Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.

[44]  Yonatan Belinkov,et al.  Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance? , 2020, EACL.

[45]  Rowan Hall Maudslay,et al.  Information-Theoretic Probing for Linguistic Structure , 2020, ACL.

[46]  Iryna Gurevych,et al.  Classification and Clustering of Arguments with Contextualized Word Embeddings , 2019, ACL.

[47]  Roee Aharoni,et al.  Unsupervised Domain Clusters in Pretrained Language Models , 2020, ACL.

[48]  Alina Wróblewska,et al.  Empirical Linguistic Study of Sentence Embeddings , 2019, ACL.

[49]  Samuel R. Bowman,et al.  Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work? , 2020, ACL.

[50]  Jonathan Berant,et al.  oLMpics-On What Language Model Pre-training Captures , 2019, Transactions of the Association for Computational Linguistics.

[51]  Elahe Rahimtoroghi,et al.  What Happens To BERT Embeddings During Fine-tuning? , 2020, BLACKBOXNLP.

[52]  Wojciech Czarnecki,et al.  How to evaluate word embeddings? On importance of data efficiency and simple supervised tasks , 2017, ArXiv.

[53]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.