High Performance Natural Language Processing

Scale has played a central role in the rapid progress natural language processing has enjoyed in recent years. While benchmarks are dominated by ever larger models, efficient hardware use is critical for their widespread adoption and further progress in the field. In this cutting-edge tutorial, we will recapitulate the state-of-the-art in natural language processing with scale in perspective. After establishing these foundations, we will cover a wide range of techniques for improving efficiency, including knowledge distillation, quantization, pruning, more efficient architectures, along with case studies and practical implementation tricks.

[1]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation , 2019, ArXiv.

[2]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[3]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[4]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[5]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[6]  Hannaneh Hajishirzi,et al.  DeLighT: Very Deep and Light-weight Transformer , 2020, ArXiv.

[7]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[8]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[9]  Omer Levy,et al.  Generalization through Memorization: Nearest Neighbor Language Models , 2020, ICLR.

[10]  Song Han,et al.  MicroNet for Efficient Language Modeling , 2020, NeurIPS.

[11]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[12]  Franccois Fleuret,et al.  Fast Transformers with Clustered Attention , 2020, NeurIPS.

[13]  Quoc V. Le,et al.  The Evolved Transformer , 2019, ICML.

[14]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[15]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[16]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[17]  Hermann Ney,et al.  Successfully Applying the Stabilized Lottery Ticket Hypothesis to the Transformer Architecture , 2020, ACL.

[18]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[19]  Song Han,et al.  HAT: Hardware-Aware Transformers for Efficient Natural Language Processing , 2020, ACL.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Kurt Keutzer,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2020, AAAI.

[22]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[23]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[24]  André F. T. Martins,et al.  Adaptively Sparse Transformers , 2019, EMNLP.

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  Alexander M. Rush,et al.  Movement Pruning: Adaptive Sparsity by Fine-Tuning , 2020, NeurIPS.

[27]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[28]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[29]  Forrest N. Iandola,et al.  SqueezeBERT: What can computer vision teach NLP about efficient neural networks? , 2020, SUSTAINLP.

[30]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[31]  Ji Ma,et al.  Natural Language Processing with Small Feed-Forward Networks , 2017, EMNLP.

[32]  Yiming Yang,et al.  MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[33]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[34]  Christopher Ré,et al.  Low-Memory Neural Network Training: A Technical Report , 2019, ArXiv.

[35]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[36]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[37]  Iain Murray,et al.  BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning , 2019, ICML.

[38]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[39]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.