Measuring Compositional Generalization: A Comprehensive Method on Realistic Data

State-of-the-art machine learning methods exhibit limited compositional generalization. At the same time, there is a lack of realistic benchmarks that comprehensively measure this ability, which makes it challenging to find and evaluate improvements. We introduce a novel method to systematically construct such benchmarks by maximizing compound divergence while guaranteeing a small atom divergence between train and test sets, and we quantitatively compare this method to other approaches for creating compositional generalization benchmarks. We present a large and realistic natural language question answering dataset that is constructed according to this method, and we use it to analyze the compositional generalization ability of three machine learning architectures. We find that they fail to generalize compositionally and that there is a surprisingly strong negative correlation between compound divergence and accuracy. We also demonstrate how our method can be used to create new compositionality benchmarks on top of the existing SCAN dataset, which confirms these findings.

[1]  Jason Weston,et al.  Jump to better conclusions: SCAN both left and right , 2018, BlackboxNLP@EMNLP.

[2]  Diego Calvanese,et al.  The Description Logic Handbook: Theory, Implementation, and Applications , 2003, Description Logic Handbook.

[3]  Jonathan Berant,et al.  Building a Semantic Parser Overnight , 2015, ACL.

[4]  Christopher D. Manning,et al.  Learning by Abstraction: The Neural State Machine , 2019, NeurIPS.

[5]  Brian M. Sadler,et al.  On Generating Characteristic-rich Question Sets for QA Evaluation , 2016, EMNLP.

[6]  Pushmeet Kohli,et al.  Analysing Mathematical Reasoning Abilities of Neural Models , 2019, ICLR.

[7]  N. Curteanu Book Reviews: Lecture on Contemporary Syntactic Theories: An Introduction to Unification-Based Approaches to Grammar , 1987, CL.

[8]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[9]  Jason Weston,et al.  Key-Value Memory Networks for Directly Reading Documents , 2016, EMNLP.

[10]  Murray Shanahan,et al.  SCAN: Learning Hierarchical Compositional Visual Concepts , 2017, ICLR.

[11]  Ming-Wei Chang,et al.  The Value of Semantic Parse Labeling for Knowledge Base Question Answering , 2016, ACL.

[12]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[13]  J. Fodor,et al.  Connectionism and cognitive architecture: A critical analysis , 1988, Cognition.

[14]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[15]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Marco Baroni,et al.  Rearranging the Familiar: Testing Compositional Generalization in Recurrent Networks , 2018, BlackboxNLP@EMNLP.

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Michael A. Covington,et al.  GULP 3.1: An extension of Prolog for unification-based grammar , 1994 .

[20]  Christopher D. Manning,et al.  Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[21]  Samy Bengio,et al.  Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[22]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[23]  Jonathan Berant,et al.  The Web as a Knowledge-Base for Answering Complex Questions , 2018, NAACL.

[24]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[25]  Dragomir R. Radev,et al.  Improving Text-to-SQL Evaluation Methodology , 2018, ACL.

[26]  Yoshua Bengio,et al.  Compositional generalization in a deep seq2seq model by separating syntax and semantics , 2019, ArXiv.

[27]  Thomas L. Griffiths,et al.  Automatically Composing Representation Transformations as a Means for Generalization , 2018, ICLR.

[28]  Jacob Andreas,et al.  Measuring Compositionality in Representation Learning , 2019, ICLR.

[29]  Marco Baroni,et al.  Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks , 2017, ICML.

[30]  C. T. Ng,et al.  Measures of distance between probability distributions , 1989 .

[31]  Lauri Karttunen,et al.  Features and Values , 1984, ACL.

[32]  Xuchen Yao,et al.  Information Extraction over Structured Data: Question Answering with Freebase , 2014, ACL.

[33]  Razvan Pascanu,et al.  Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.

[34]  Dhruv Batra,et al.  Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Franz Baader,et al.  Pushing the EL Envelope , 2005, IJCAI.

[36]  Chuang Gan,et al.  The Neuro-Symbolic Concept Learner: Interpreting Scenes Words and Sentences from Natural Supervision , 2019, ICLR.

[37]  Le Song,et al.  Variational Reasoning for Question Answering with Knowledge Graph , 2017, AAAI.

[38]  Noam Chomsky,et al.  वाक्यविन्यास का सैद्धान्तिक पक्ष = Aspects of the theory of syntax , 1965 .

[39]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.