Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

A lot of the recent success in natural language processing (NLP) has been driven by distributed vector representations of words trained on large amounts of text in an unsupervised manner. These representations are typically used as general purpose features for words across a range of NLP problems. However, extending this success to learning representations of sequences of words, such as sentences, remains an open problem. Recent work has explored unsupervised as well as supervised learning techniques with different training objectives to learn general purpose fixed-length sentence representations. In this work, we present a simple, effective multi-task learning framework for sentence representations that combines the inductive biases of diverse training objectives in a single model. We train this model on several data sources with multiple training objectives on over 100 million sentences. Extensive experiments demonstrate that sharing a single recurrent sentence encoder across weakly related tasks leads to consistent improvements over previous methods. We present substantial improvements in the context of transfer learning and low-resource settings using our learned general-purpose representations.

[1]  Ilya Sutskever,et al.  Learning to Generate Reviews and Discovering Sentiment , 2017, ArXiv.

[2]  Felix Hill,et al.  Learning Distributed Representations of Sentences from Unlabelled Data , 2016, NAACL.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Mirella Lapata,et al.  Vector-based Models of Semantic Composition , 2008, ACL.

[5]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[6]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[7]  Wojciech Czarnecki,et al.  How to evaluate word embeddings? On importance of data efficiency and simple supervised tasks , 2017, ArXiv.

[8]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[9]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[10]  Manaal Faruqui,et al.  Community Evaluation and Exchange of Word Vectors at wordvectors.org , 2014, ACL.

[11]  Kevin Gimpel,et al.  Charagram: Embedding Words and Sentences via Character n-grams , 2016, EMNLP.

[12]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[13]  Stephen Clark,et al.  Combining Symbolic and Distributional Models of Meaning , 2007, AAAI Spring Symposium: Quantum Interaction.

[14]  Zhe Gan,et al.  Unsupervised Learning of Sentence Representations using Convolutional Neural Networks , 2016, ArXiv.

[15]  Peng Zhou,et al.  Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling , 2016, COLING.

[16]  Xing Shi,et al.  Does String-Based Neural MT Learn Source Syntax? , 2016, EMNLP.

[17]  Matteo Pagliardini,et al.  Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.

[18]  Kevin Gimpel,et al.  Towards Universal Paraphrastic Sentence Embeddings , 2015, ICLR.

[19]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[20]  Shumeet Baluja,et al.  Advances in Neural Information Processing , 1994 .

[21]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[22]  Iasonas Kokkinos,et al.  UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yonatan Belinkov,et al.  What do Neural Machine Translation Models Learn about Morphology? , 2017, ACL.

[24]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[25]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[26]  Guillaume Lample,et al.  Evaluation of Word Vector Representations by Subspace Alignment , 2015, EMNLP.

[27]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[29]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[30]  Christopher Potts,et al.  A Fast Unified Model for Parsing and Sentence Understanding , 2016, ACL.

[31]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[32]  Zhiguo Wang,et al.  Bilateral Multi-Perspective Matching for Natural Language Sentences , 2017, IJCAI.

[33]  Hong Yu,et al.  Neural Semantic Encoders , 2016, EACL.

[34]  Yoshua Bengio,et al.  Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[36]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[37]  Hailin Jin,et al.  Rethinking Skip-thought: A Neighborhood based Approach , 2017, Rep4NLP@ACL.

[38]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[40]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[41]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[42]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[43]  Noah D. Goodman,et al.  DisSent: Sentence Representation Learning from Explicit Discourse Relations , 2017, ArXiv.

[44]  Lawrence Carin,et al.  Deconvolutional Latent-Variable Model for Text Sequence Matching , 2017, AAAI.

[45]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[46]  Jakob Uszkoreit,et al.  Neural Paraphrase Identification of Questions with Noisy Pretraining , 2017, SWCN@EMNLP.

[47]  Quoc V. Le,et al.  Multi-task Sequence to Sequence Learning , 2015, ICLR.

[48]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[49]  Jakob Uszkoreit,et al.  A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[50]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[51]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[52]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[53]  Samuel R. Bowman,et al.  Discourse-Based Objectives for Fast Unsupervised Sentence Representation Learning , 2017, ArXiv.

[54]  Anna Korhonen,et al.  Semantic Specialization of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints , 2017, TACL.

[55]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[56]  Yoshimasa Tsuruoka,et al.  A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks , 2016, EMNLP.