Deep Contextualized Word Representations

We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.

[1]  Ming Zhou,et al.  Gated Self-Matching Networks for Reading Comprehension and Question Answering , 2017, ACL.

[2]  Luke S. Zettlemoyer,et al.  End-to-end Neural Coreference Resolution , 2017, EMNLP.

[3]  Andrew McCallum,et al.  Efficient Non-parametric Estimation of Multiple Embeddings per Word in Vector Space , 2014, EMNLP.

[4]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[5]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[6]  Chandra Bhagavatula,et al.  Semi-supervised sequence tagging with bidirectional language models , 2017, ACL.

[7]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[8]  Christopher Clark,et al.  Simple and Effective Multi-Paragraph Reading Comprehension , 2017, ACL.

[9]  Jian Zhang,et al.  Natural Language Inference over Interaction Space , 2017, ICLR.

[10]  Zhen-Hua Ling,et al.  Enhanced LSTM for Natural Language Inference , 2016, ACL.

[11]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[12]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[13]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[14]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[15]  Xiaodong Liu,et al.  Stochastic Answer Networks for Machine Reading Comprehension , 2017, ACL.

[16]  Hwee Tou Ng,et al.  Towards Robust Linguistic Analysis using OntoNotes , 2013, CoNLL.

[17]  Anders Søgaard,et al.  Deep multi-task learning with low level tasks supervised at lower layers , 2016, ACL.

[18]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[19]  Kevin Gimpel,et al.  Charagram: Embedding Words and Sentences via Character n-grams , 2016, EMNLP.

[20]  Dan Klein,et al.  Easy Victories and Uphill Battles in Coreference Resolution , 2013, EMNLP.

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[23]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[24]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[25]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[26]  Alexander M. Rush,et al.  Learning Global Features for Coreference Resolution , 2016, NAACL.

[27]  Ignacio Iacobacci,et al.  Embeddings for Word Sense Disambiguation: An Evaluation Study , 2016, ACL.

[28]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[29]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[30]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[31]  Roberto Navigli,et al.  Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison , 2017, EACL.

[32]  Yoshimasa Tsuruoka,et al.  A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks , 2016, EMNLP.

[33]  Yonatan Belinkov,et al.  What do Neural Machine Translation Models Learn about Morphology? , 2017, ACL.

[34]  Hong Yu,et al.  Neural Tree Indexers for Text Understanding , 2016, EACL.

[35]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[36]  Ido Dagan,et al.  context2vec: Learning Generic Context Embedding with Bidirectional LSTM , 2016, CoNLL.

[37]  Wei Xu,et al.  End-to-end learning of semantic role labeling using recurrent neural networks , 2015, ACL.

[38]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[39]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[40]  Peng Zhou,et al.  Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling , 2016, COLING.

[41]  Luke S. Zettlemoyer,et al.  Deep Semantic Role Labeling: What Works and What’s Next , 2017, ACL.

[42]  George A. Miller,et al.  Using a Semantic Concordance for Sense Identification , 1994, HLT.

[43]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[44]  Roberto Navigli,et al.  Neural Sequence Learning Models for Word Sense Disambiguation , 2017, EMNLP.

[45]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[46]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[47]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[48]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[49]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[50]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[51]  Yuchen Zhang,et al.  CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes , 2012, EMNLP-CoNLL Shared Task.

[52]  Christopher D. Manning,et al.  Deep Reinforcement Learning for Mention-Ranking Coreference Models , 2016, EMNLP.

[53]  Wojciech Zaremba,et al.  An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.

[54]  Wang Ling,et al.  Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , 2015, EMNLP.

[55]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[56]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[57]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[58]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[59]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.