Multi-scale Transformer Language Models

We investigate multi-scale transformer language models that learn representations of text at multiple scales, and present three different architectures that have an inductive bias to handle the hierarchical nature of language. Experiments on large-scale language modeling benchmarks empirically demonstrate favorable likelihood vs memory footprint trade-offs, e.g. we show that it is possible to train a hierarchical variant with 30 layers that has 23% smaller memory footprint and better perplexity, compared to a vanilla transformer with less than half the number of layers, on the Toronto BookCorpus. We analyze the advantages of learned representations at multiple scales in terms of memory footprint, compute time, and perplexity, which are particularly appealing given the quadratic scaling of transformers' run time and memory usage with respect to sequence length.

[1]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[2]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[4]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[5]  Guillaume Lample,et al.  Large Memory Layers with Product Keys , 2019, NeurIPS.

[6]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[7]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[8]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[9]  Yoshua Bengio,et al.  Hierarchical Recurrent Neural Networks for Long-Term Dependencies , 1995, NIPS.

[10]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[13]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[14]  Yann Dauphin,et al.  Hierarchical Neural Story Generation , 2018, ACL.

[15]  D. Pelli,et al.  The uncrowded window of object recognition , 2008, Nature Neuroscience.

[16]  Jürgen Schmidhuber,et al.  Learning Complex, Extended Sequences Using the Principle of History Compression , 1992, Neural Computation.

[17]  Inderjit S. Dhillon,et al.  Multiresolution Transformer Networks: Recurrence is Not Essential for Modeling Hierarchical Structure , 2019, ArXiv.

[18]  Raquel Urtasun,et al.  The Reversible Residual Network: Backpropagation Without Storing Activations , 2017, NIPS.

[19]  Eero P. Simoncelli,et al.  Metamers of the ventral stream , 2011, Nature Neuroscience.

[20]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[21]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[22]  Alexei Baevski,et al.  Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[23]  Vladlen Koltun,et al.  Deep Equilibrium Models , 2019, NeurIPS.

[24]  Daniel Jurafsky,et al.  Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context , 2018, ACL.

[25]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[26]  Yoshua Bengio,et al.  Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.

[27]  Jason Weston,et al.  Neural Text Generation with Unlikelihood Training , 2019, ICLR.

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Mirella Lapata,et al.  Hierarchical Transformers for Multi-Document Summarization , 2019, ACL.

[30]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[31]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[32]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[33]  Stanislau Semeniuta,et al.  On Accurate Evaluation of GANs for Language Generation , 2018, ArXiv.

[34]  Timothy P. Lillicrap,et al.  Compressive Transformers for Long-Range Sequence Modelling , 2019, ICLR.

[35]  Noah Constant,et al.  Character-Level Language Modeling with Deeper Self-Attention , 2018, AAAI.

[36]  Lukasz Kaiser,et al.  Sample Efficient Text Summarization Using a Single Pre-Trained Transformer , 2019, ArXiv.

[37]  Enrique Alfonseca,et al.  Eval all, trust a few, do wrong to none: Comparing sentence generation models , 2018, ArXiv.

[38]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[39]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[40]  Jürgen Schmidhuber,et al.  A Clockwork RNN , 2014, ICML.

[41]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[42]  Christopher Joseph Pal,et al.  Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study , 2019, ACL.

[43]  Edward H. Adelson,et al.  The Laplacian Pyramid as a Compact Image Code , 1983, IEEE Trans. Commun..

[44]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[45]  Leon A. Gatys,et al.  Image content is more important than Bouma’s Law for scene metamers , 2018 .

[46]  Edouard Grave,et al.  Adaptive Attention Span in Transformers , 2019, ACL.

[47]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.