The NLP Cookbook: Modern Recipes for Transformer Based Deep Learning Architectures

In recent years, Natural Language Processing (NLP) models have achieved phenomenal success in linguistic and semantic tasks like text classification, machine translation, cognitive dialogue systems, information retrieval via Natural Language Understanding (NLU), and Natural Language Generation (NLG). This feat is primarily attributed due to the seminal Transformer architecture, leading to designs such as BERT, GPT (I, II, III), etc. Although these large-size models have achieved unprecedented performances, they come at high computational costs. Consequently, some of the recent NLP architectures have utilized concepts of transfer learning, pruning, quantization, and knowledge distillation to achieve moderate model sizes while keeping nearly similar performances as achieved by their predecessors. Additionally, to mitigate the data size challenge raised by language models from a knowledge extraction perspective, Knowledge Retrievers have been built to extricate explicit data documents from a large corpus of databases with greater efficiency and accuracy. Recent research has also focused on superior inference by providing efficient attention to longer input sequences. In this paper, we summarize and examine the current state-of-the-art (SOTA) NLP models that have been employed for numerous NLP tasks for optimal performance and efficiency. We provide a detailed understanding and functioning of the different architectures, a taxonomy of NLP designs, comparative evaluations, and future directions in NLP.

[1]  G. Hua,et al.  LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks , 2018, ECCV.

[2]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[3]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[4]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[5]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[6]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[7]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[8]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[9]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[10]  Noah Constant,et al.  Character-Level Language Modeling with Deeper Self-Attention , 2018, AAAI.

[11]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[12]  Lukasz Kaiser,et al.  Rethinking Attention with Performers , 2020, ArXiv.

[13]  R. Thomas McCoy,et al.  BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance , 2020, BLACKBOXNLP.

[14]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[15]  Torsten Hoefler,et al.  Shapeshifter Networks: Cross-layer Parameter Sharing for Scalable and Effective Deep Learning , 2020, ArXiv.

[16]  Timothy T. Baldwin,et al.  TRANSFER OF TRAINING: A REVIEW AND DIRECTIONS FOR FUTURE RESEARCH , 1988 .

[17]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[18]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[19]  Ido Dagan,et al.  context2vec: Learning Generic Context Embedding with Bidirectional LSTM , 2016, CoNLL.

[20]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[21]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[22]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[23]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[24]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Colin Raffel,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[27]  Guillaume Lample,et al.  Phrase-Based & Neural Unsupervised Machine Translation , 2018, EMNLP.

[28]  Matthias Gallé,et al.  To Annotate or Not? Predicting Performance Drop under Domain Shift , 2019, EMNLP.

[29]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[30]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[31]  Li Yang,et al.  ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.

[32]  Raquel Urtasun,et al.  The Reversible Residual Network: Backpropagation Without Storing Activations , 2017, NIPS.

[33]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[34]  Ming-Wei Chang,et al.  Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[35]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[36]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[37]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[38]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[40]  Mingjie Sun,et al.  Rethinking the Value of Network Pruning , 2018, ICLR.

[41]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[42]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[43]  Jimmy J. Lin,et al.  Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.

[44]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[45]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[46]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[47]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[48]  Zijian Wang,et al.  Answering Complex Open-domain Questions Through Iterative Query Generation , 2019, EMNLP.

[49]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[50]  Suyog Gupta,et al.  To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[51]  Yang Liu,et al.  Visualizing and Understanding Neural Machine Translation , 2017, ACL.

[52]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[53]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[54]  Eneko Agirre,et al.  Learning bilingual word embeddings with (almost) no bilingual data , 2017, ACL.

[55]  Preslav Nakov,et al.  Poor Man's BERT: Smaller and Faster Transformer Models , 2020, ArXiv.

[56]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[57]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[58]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[59]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[60]  Lin Xu,et al.  Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights , 2017, ICLR.

[61]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[62]  Barbara Plank,et al.  Learning to select data for transfer learning with Bayesian Optimization , 2017, EMNLP.

[63]  Xiaodong Liu,et al.  Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.

[64]  Yiming Yang,et al.  MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[65]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[66]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[67]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[68]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[69]  Rico Sennrich,et al.  Context-Aware Neural Machine Translation Learns Anaphora Resolution , 2018, ACL.

[70]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[71]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[72]  Yijia Liu,et al.  Cross-Lingual BERT Transformation for Zero-Shot Dependency Parsing , 2019, EMNLP.

[73]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[74]  Edouard Grave,et al.  Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[75]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[76]  Michael W. Mahoney,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.

[77]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[78]  Paolo Napoletano,et al.  Benchmark Analysis of Representative Deep Neural Network Architectures , 2018, IEEE Access.

[79]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[80]  Ziheng Wang,et al.  Structured Pruning of Large Language Models , 2020, EMNLP.

[81]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[82]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[83]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[84]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[85]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[86]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[87]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[88]  Noam Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.

[89]  Wei Pan,et al.  Towards Accurate Binary Convolutional Neural Network , 2017, NIPS.

[90]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[91]  Erich Elsen,et al.  Exploring Sparsity in Recurrent Neural Networks , 2017, ICLR.

[92]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[93]  Quoc V. Le,et al.  Unsupervised Pretraining for Sequence to Sequence Learning , 2016, EMNLP.

[94]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[95]  Daniel Jurafsky,et al.  Noising and Denoising Natural Language: Diverse Backtranslation for Grammar Correction , 2018, NAACL.