Hierarchical Transformers Are More Efficient Language Models

Transformer models yield impressive results on many NLP and sequence modeling tasks. Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs: full paragraphs produced by GPT-3 or well-structured images produced by DALL-E. These large language models are impressive but also very inefficient and costly, which limits their applications and accessibility. We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences. To verify this claim, we first study different ways to downsample and upsample activations in Transformers so as to make them hierarchical. We use the best performing upsampling and downsampling layers to create Hourglass - a hierarchical Transformer language model. Hourglass improves upon the Transformer baseline given the same amount of computation and can yield the same results as Transformers more efficiently. In particular, Hourglass sets new state-of-the-art for Transformer models on the ImageNet32 generation task and improves language modeling efficiency on the widely studied enwik8 benchmark.

[1]  Aurko Roy,et al.  Efficient Content-Based Sparse Attention with Routing Transformers , 2021, TACL.

[2]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[3]  John Wieting,et al.  CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation , 2021, ArXiv.

[4]  Lukasz Kaiser,et al.  Rethinking Attention with Performers , 2020, ArXiv.

[5]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[6]  Jure Leskovec,et al.  Combiner: Full Attention Transformer with Sparse Computation Cost , 2021, NeurIPS.

[7]  Kyungwoo Song,et al.  Score Matching Model for Unbounded Data Score , 2021, ArXiv.

[8]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[9]  Tim Salimans,et al.  Axial Attention in Multidimensional Transformers , 2019, ArXiv.

[10]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[11]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[12]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[13]  Jonathan Ho,et al.  Variational Diffusion Models , 2021, ArXiv.

[14]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[15]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[16]  Zhen Qin,et al.  Charformer: Fast Character Transformers via Gradient-based Subword Tokenization , 2021, ArXiv.

[17]  Guokun Lai,et al.  Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing , 2020, NeurIPS.

[18]  Edouard Grave,et al.  Adaptive Attention Span in Transformers , 2019, ACL.

[19]  Anima Anandkumar,et al.  Long-Short Transformer: Efficient Transformers for Language and Vision , 2021, NeurIPS.

[20]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[21]  Jason Weston,et al.  Not All Memories are Created Equal: Learning to Forget by Expiring , 2021, ICML.

[22]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[23]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[24]  Colin Raffel,et al.  ByT5: Towards a token-free future with pre-trained byte-to-byte models , 2021, ArXiv.

[25]  Douglas Eck,et al.  Music Transformer , 2018, 1809.04281.

[26]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[27]  Zheng Zhang,et al.  BP-Transformer: Modelling Long-Range Context via Binary Partitioning , 2019, ArXiv.

[28]  Marc'Aurelio Ranzato,et al.  Multi-scale Transformer Language Models , 2020, ArXiv.

[29]  Shengfeng Pan,et al.  RoFormer: Enhanced Transformer with Rotary Position Embedding , 2021, ArXiv.

[30]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[31]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[32]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[33]  Ilya Sutskever,et al.  Jukebox: A Generative Model for Music , 2020, ArXiv.