Efficient Contextual Representation Learning With Continuous Outputs

Contextual representation models have achieved great success in improving various downstream natural language processing tasks. However, these language-model-based encoders are difficult to train due to their large parameter size and high computational complexity. By carefully examining the training procedure, we observe that the softmax layer, which predicts a distribution of the target word, often induces significant overhead, especially when the vocabulary size is large. Therefore, we revisit the design of the output layer and consider directly predicting the pre-trained embedding of the target word for a given context. When applied to ELMo, the proposed approach achieves a 4-fold speedup and eliminates 80% trainable parameters while achieving competitive performance on downstream tasks. Further analysis shows that the approach maintains the speed advantage under various settings, even when the sentence encoder is scaled up.

[1]  Luke S. Zettlemoyer,et al.  Dissecting Contextual Word Embeddings: Architecture and Representation , 2018, EMNLP.

[2]  Stefano Faralli,et al.  Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl , 2017, LREC.

[3]  Ankur Bapna,et al.  The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[4]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[5]  James Demmel,et al.  ImageNet Training in Minutes , 2017, ICPP.

[6]  Blockin Blockin,et al.  Quick Training of Probabilistic Neural Nets by Importance Sampling , 2003 .

[7]  Luke S. Zettlemoyer,et al.  End-to-end Neural Coreference Resolution , 2017, EMNLP.

[8]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[9]  Hailin Jin,et al.  Speeding up Context-based Sentence Representation Learning with Non-autoregressive Convolutional Decoding , 2017, Rep4NLP@ACL.

[10]  Hai Zhao,et al.  Syntax for Semantic Role Labeling, To Be, Or Not To Be , 2018, ACL.

[11]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[12]  Ruslan Salakhutdinov,et al.  Breaking the Softmax Bottleneck: A High-Rank RNN Language Model , 2017, ICLR.

[13]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[14]  Dan Klein,et al.  Multilingual Constituency Parsing with Self-Attention and Pre-Training , 2018, ACL.

[15]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[16]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[17]  Yuchen Zhang,et al.  CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes , 2012, EMNLP-CoNLL Shared Task.

[18]  Luke S. Zettlemoyer,et al.  Higher-Order Coreference Resolution with Coarse-to-Fine Inference , 2018, NAACL.

[19]  Richard Socher,et al.  An Analysis of Neural Language Modeling at Multiple Scales , 2018, ArXiv.

[20]  Yee Whye Teh,et al.  A fast and simple algorithm for training neural probabilistic language models , 2012, ICML.

[21]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[24]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[25]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[28]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[29]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[30]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[31]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[32]  Honglak Lee,et al.  An efficient framework for learning sentence representations , 2018, ICLR.

[33]  Hwee Tou Ng,et al.  Towards Robust Linguistic Analysis using OntoNotes , 2013, CoNLL.

[34]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[35]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[36]  Samuel R. Bowman,et al.  Discourse-Based Objectives for Fast Unsupervised Sentence Representation Learning , 2017, ArXiv.

[37]  Luke S. Zettlemoyer,et al.  Deep Semantic Role Labeling: What Works and What’s Next , 2017, ACL.

[38]  Yu Zhang,et al.  Simple Recurrent Units for Highly Parallelizable Recurrence , 2017, EMNLP.

[39]  Jacob Eisenstein,et al.  Mimicking Word Embeddings using Subword RNNs , 2017, EMNLP.

[40]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[41]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[42]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[43]  Richard Socher,et al.  Quasi-Recurrent Neural Networks , 2016, ICLR.

[44]  Moustapha Cissé,et al.  Efficient softmax approximation for GPUs , 2016, ICML.

[45]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[46]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[47]  Zhen-Hua Ling,et al.  Enhanced LSTM for Natural Language Inference , 2016, ACL.

[48]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[49]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[50]  Yulia Tsvetkov,et al.  Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs , 2018, ICLR.

[51]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[52]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.