论文信息 - The EOS Decision and Length Extrapolation

The EOS Decision and Length Extrapolation

Extrapolation to unseen sequence lengths is a challenge for neural generative models of language. In this work, we characterize the effect on length extrapolation of a modeling decision often overlooked: predicting the end of the generative process through the use of a special end-of-sequence (EOS) vocabulary item. We study an oracle setting - forcing models to generate to the correct sequence length at test time - to compare the length-extrapolative behavior of networks trained to predict EOS (+EOS) with networks not trained to (-EOS). We find that -EOS substantially outperforms +EOS, for example extrapolating well to lengths 10 times longer than those seen at training time in a bracket closing task, as well as achieving a 40% improvement over +EOS in the difficult SCAN dataset length generalization task. By comparing the hidden states and dynamics of -EOS and +EOS models, we observe that +EOS models fail to generalize because they (1) unnecessarily stratify their hidden states by their linear position is a sequence (structures we call length manifolds) or (2) get stuck in clusters (which we refer to as length attractors) once the EOS token is the highest-probability prediction.

[1] Yonatan Belinkov,et al. On Evaluating the Generalization of LSTM Models in Formal Languages , 2018, ArXiv.

[2] David Chiang,et al. Correcting Length Bias in Neural Machine Translation , 2018, WMT.

[3] Alexander M. Rush,et al. OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[4] Yonatan Belinkov,et al. LSTM Networks Can Perform Dynamic Counting , 2019, Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges.

[5] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[6] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[7] Deniz Yuret,et al. Why Neural Translations are the Right Length , 2016, EMNLP.

[8] László Dezsö,et al. Universal Grammar , 1981, Certainty in Action.

[9] William Merrill,et al. On the Linguistic Capacity of Real-Time Counter Automata , 2020, ArXiv.

[10] Alex Graves,et al. Neural Turing Machines , 2014, ArXiv.

[11] Alex 'Sandy' Pentland,et al. A Study of Compositional Generalization in Neural Models , 2020, ArXiv.

[12] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[13] Allyson Ettinger,et al. Assessing Composition in Sentence Vector Representations , 2018, COLING.

[14] Gary Marcus,et al. Deep Learning: A Critical Appraisal , 2018, ArXiv.

[15] J. L. Austin,et al. The foundations of arithmetic : a logico-mathematical enquiry into the concept of number , 1951 .

[16] Pasquale Minervini,et al. Extrapolation in NLP , 2018, ArXiv.