Can Transformers Jump Around Right in Natural Language? Assessing Performance Transfer from SCAN

Despite their failure to solve the compositional SCAN dataset, seq2seq architectures still achieve astonishing success on more practical tasks. This observation pushes us to question the usefulness of SCAN-style compositional generalization in realistic NLP tasks. In this work, we study the benefit that such compositionality brings about to several machine translation tasks. We present several focused modifications of Transformer that greatly improve generalization capabilities on SCAN and select one that remains on par with a vanilla Transformer on a standard machine translation (MT) task. Next, we study its performance in low-resource settings and on a newly introduced distribution-shifted English-French translation task. Overall, we find that improvements of a SCAN-capable model do not directly transfer to the resource-rich MT setup. In contrast, in the low-resource setup, general modifications lead to an improvement of up to 13.1% BLEU score w.r.t. a vanilla Transformer. Similarly, an improvement of 14% in an accuracy-based metric is achieved in the introduced compositional English-French translation task. This provides experimental evidence that the compositional generalization assessed in SCAN is particularly useful in resource-starved and domain-shifted scenarios.

[1]  E. Kharitonov,et al.  What they do when in doubt: a study of inductive biases in seq2seq learners , 2020, ICLR.

[2]  Philipp Koehn,et al.  Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English , 2019, ArXiv.

[3]  Martin Jaggi,et al.  On the Relationship between Self-Attention and Convolutional Layers , 2019, ICLR.

[4]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[5]  Edouard Grave,et al.  Adaptive Attention Span in Transformers , 2019, ACL.

[6]  Mathijs Mul,et al.  Compositionality Decomposed: How do Neural Networks Generalise? , 2019, J. Artif. Intell. Res..

[7]  Jack W. Rae,et al.  Do Transformers Need Deep Long-Range Memory? , 2020, ACL.

[8]  Jason Weston,et al.  Jump to better conclusions: SCAN both left and right , 2018, BlackboxNLP@EMNLP.

[9]  Marco Baroni,et al.  Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks , 2017, ICML.

[10]  Marco Baroni,et al.  Linguistic generalization and compositionality in modern artificial neural networks , 2019, Philosophical Transactions of the Royal Society B.

[11]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[12]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[13]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[14]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[17]  Anand Singh,et al.  Learning compositionally through attentive guidance , 2018, ArXiv.

[18]  Quoc V. Le,et al.  Towards a Human-like Open-Domain Chatbot , 2020, ArXiv.

[19]  Marco Baroni,et al.  Rearranging the Familiar: Testing Compositional Generalization in Recurrent Networks , 2018, BlackboxNLP@EMNLP.

[20]  Tong Zhang,et al.  Modeling Localness for Self-Attention Networks , 2018, EMNLP.

[21]  David Lopez-Paz,et al.  Permutation Equivariant Models for Compositional Generalization in Language , 2020, ICLR.

[22]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[23]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[24]  Razvan Pascanu,et al.  Stabilizing Transformers for Reinforcement Learning , 2019, ICML.

[25]  R. Thomas McCoy,et al.  Does Syntax Need to Grow on Trees? Sources of Hierarchical Inductive Bias in Sequence-to-Sequence Networks , 2020, TACL.

[26]  Yoshua Bengio,et al.  Compositional generalization in a deep seq2seq model by separating syntax and semantics , 2019, ArXiv.

[27]  Elia Bruni,et al.  The paradox of the compositionality of natural language: a neural machine translation case study , 2021, ArXiv.

[28]  Chen Liang,et al.  Compositional Generalization via Neural-Symbolic Stack Machines , 2020, NeurIPS.

[29]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[30]  Brenden M. Lake,et al.  Compositional generalization through meta sequence-to-sequence learning , 2019, NeurIPS.

[31]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[32]  Richard Futrell,et al.  Large-scale evidence of dependency length minimization in 37 languages , 2015, Proceedings of the National Academy of Sciences.

[33]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[34]  Marco Baroni,et al.  CNNs found to jump around more skillfully than RNNs: Compositional Generalization in Seq2seq Convolutional Networks , 2019, ACL.

[35]  Liang Zhao,et al.  Compositional Generalization for Primitive Substitutions , 2019, EMNLP.