Probing What Different NLP Tasks Teach Machines about Function Word Comprehension

We introduce a set of nine challenge tasks that test for the understanding of function words. These tasks are created by structurally mutating sentences from existing datasets to target the comprehension of specific types of function words (e.g., prepositions, wh-words). Using these probing tasks, we explore the effects of various pretraining objectives for sentence encoders (e.g., language modeling, CCG supertagging and natural language inference (NLI)) on the learned representations. Our results show that pretraining on CCG—our most syntactic objective—performs the best on average across our probing tasks, suggesting that syntactic knowledge helps function word comprehension. Language modeling also shows strong performance, supporting its widespread use for pretraining state-of-the-art NLP models. Overall, no pretraining objective dominates across the board, and our function word probing tasks highlight several intuitive differences between pretraining objectives, e.g., that NLI helps the comprehension of negation.

[1]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[2]  Carolyn Penstein Rosé,et al.  Stress Test Evaluation for Natural Language Inference , 2018, COLING.

[3]  Allan Jabri,et al.  Learning Visually Grounded Sentence Representations , 2018, NAACL.

[4]  Sheng Zhang,et al.  Ordinal Common-sense Inference , 2016, TACL.

[5]  Yonatan Belinkov,et al.  On the Evaluation of Semantic Phenomena in Neural Machine Translation Using Natural Language Inference , 2018, NAACL.

[6]  Gerard de Melo,et al.  Exploring Semantic Properties of Sentence Embeddings , 2018, ACL.

[7]  Alex Wang,et al.  What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[8]  Anders Søgaard,et al.  Linguistic representations in multi-task neural networks for ellipsis resolution , 2018, BlackboxNLP@EMNLP.

[9]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Edouard Grave,et al.  Colorless Green Recurrent Networks Dream Hierarchically , 2018, NAACL.

[12]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[13]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[14]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[15]  Dieuwke Hupkes,et al.  Do Language Models Understand Anything? On the Ability of LSTMs to Understand Negative Polarity Items , 2018, BlackboxNLP@EMNLP.

[16]  Samuel R. Bowman,et al.  Discourse-Based Objectives for Fast Unsupervised Sentence Representation Learning , 2017, ArXiv.

[17]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[18]  Rachel Rudinger,et al.  Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation , 2018, BlackboxNLP@EMNLP.

[19]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[20]  Micha Elsner,et al.  Breaking NLP: Using Morphosyntax, Semantics, Pragmatics and World Knowledge to Fool Sentiment Analysis Systems , 2017 .

[21]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[22]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[23]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[24]  Nathan Schneider,et al.  Comprehensive Supersense Disambiguation of English Prepositions and Possessives , 2018, ACL.

[25]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[26]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[27]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[28]  Allyson Ettinger,et al.  Probing for semantic evidence of composition by means of simple classification tasks , 2016, RepEval@ACL.

[29]  Florian Mohnert,et al.  Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information , 2018, BlackboxNLP@EMNLP.

[30]  Mark Steedman,et al.  CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank , 2007, CL.

[31]  Kevin Duh,et al.  Inference is Everything: Recasting Semantic Resources into a Unified Evaluation Framework , 2017, IJCNLP.

[32]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[33]  Srinivas Bangalore,et al.  Supertagging: An Approach to Almost Parsing , 1999, CL.

[34]  Rui Yan,et al.  Natural Language Inference by Tree-Based Convolution and Heuristic Matching , 2015, ACL.

[35]  Noah D. Goodman,et al.  Evaluating Compositionality in Sentence Embeddings , 2018, CogSci.

[36]  Noah D. Goodman,et al.  DisSent: Sentence Representation Learning from Explicit Discourse Relations , 2017, ArXiv.

[37]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[38]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[39]  Hailin Jin,et al.  Rethinking Skip-thought: A Neighborhood based Approach , 2017, Rep4NLP@ACL.

[40]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[41]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[42]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[43]  Gemma Boleda,et al.  Distributional Semantics in Use , 2015, LSDSem@EMNLP.

[44]  Emily M. Bender,et al.  Towards Linguistically Generalizable NLP Systems: A Workshop and Shared Task , 2017, Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems.

[45]  Ieva Staliūnaite,et al.  Breaking Sentiment Analysis of Movie Reviews , 2017 .

[46]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).