The EOS Decision and Length Extrapolation

Extrapolation to unseen sequence lengths is a challenge for neural generative models of language. In this work, we characterize the effect on length extrapolation of a modeling decision often overlooked: predicting the end of the generative process through the use of a special end-of-sequence (EOS) vocabulary item. We study an oracle setting - forcing models to generate to the correct sequence length at test time - to compare the length-extrapolative behavior of networks trained to predict EOS (+EOS) with networks not trained to (-EOS). We find that -EOS substantially outperforms +EOS, for example extrapolating well to lengths 10 times longer than those seen at training time in a bracket closing task, as well as achieving a 40% improvement over +EOS in the difficult SCAN dataset length generalization task. By comparing the hidden states and dynamics of -EOS and +EOS models, we observe that +EOS models fail to generalize because they (1) unnecessarily stratify their hidden states by their linear position is a sequence (structures we call length manifolds) or (2) get stuck in clusters (which we refer to as length attractors) once the EOS token is the highest-probability prediction.

[1]  Yonatan Belinkov,et al.  On Evaluating the Generalization of LSTM Models in Formal Languages , 2018, ArXiv.

[2]  David Chiang,et al.  Correcting Length Bias in Neural Machine Translation , 2018, WMT.

[3]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[4]  Yonatan Belinkov,et al.  LSTM Networks Can Perform Dynamic Counting , 2019, Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges.

[5]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[6]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[7]  Deniz Yuret,et al.  Why Neural Translations are the Right Length , 2016, EMNLP.

[8]  László Dezsö,et al.  Universal Grammar , 1981, Certainty in Action.

[9]  William Merrill,et al.  On the Linguistic Capacity of Real-Time Counter Automata , 2020, ArXiv.

[10]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[11]  Alex 'Sandy' Pentland,et al.  A Study of Compositional Generalization in Neural Models , 2020, ArXiv.

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Allyson Ettinger,et al.  Assessing Composition in Sentence Vector Representations , 2018, COLING.

[14]  Gary Marcus,et al.  Deep Learning: A Critical Appraisal , 2018, ArXiv.

[15]  J. L. Austin,et al.  The foundations of arithmetic : a logico-mathematical enquiry into the concept of number , 1951 .

[16]  Pasquale Minervini,et al.  Extrapolation in NLP , 2018, ArXiv.

[17]  Yejin Choi,et al.  Globally Coherent Text Generation with Neural Checklist Models , 2016, EMNLP.

[18]  Mathijs Mul,et al.  Compositionality Decomposed: How do Neural Networks Generalise? , 2019, J. Artif. Intell. Res..

[19]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[20]  Elia Bruni,et al.  Location Attention for Extrapolation to Longer Sequences , 2020, ACL.

[21]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[22]  Brenden M. Lake,et al.  Compositional generalization through meta sequence-to-sequence learning , 2019, NeurIPS.

[23]  Eran Yahav,et al.  On the Practical Computational Power of Finite Precision RNNs for Language Recognition , 2018, ACL.

[24]  Armando Solar-Lezama,et al.  Learning Compositional Rules via Neural Program Synthesis , 2020, NeurIPS.

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  A Benchmark for Systematic Generalization in Grounded Language Understanding , 2020, NeurIPS.

[27]  Surya Ganguli,et al.  RNNs Can Generate Bounded Hierarchical Languages with Optimal Memory , 2020, EMNLP.

[28]  Marco Baroni,et al.  Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks , 2017, ICML.

[29]  Marco Baroni,et al.  Rearranging the Familiar: Testing Compositional Generalization in Recurrent Networks , 2018, BlackboxNLP@EMNLP.

[30]  David Lopez-Paz,et al.  Permutation Equivariant Models for Compositional Generalization in Language , 2020, ICLR.

[31]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.