Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations

We extend the work of Wieting et al. (2017), back-translating a large parallel corpus to produce a dataset of more than 51 million English-English sentential paraphrase pairs in a dataset we call ParaNMT-50M. We find this corpus to be cover many domains and styles of text, in addition to being rich in paraphrases with different sentence structure, and we release it to the community. To show its utility, we use it to train paraphrastic sentence embeddings using only minor changes to the framework of Wieting et al. (2016b). The resulting embeddings outperform all supervised systems on every SemEval semantic textual similarity (STS) competition, and are a significant improvement in capturing paraphrastic similarity over all other sentence embeddings We also show that our embeddings perform competitively on general sentence embedding tasks. We release the corpus, pretrained sentence embeddings, and code to generate them. We believe the corpus can be a valuable resource for automatic paraphrase generation and can provide a rich source of semantic information to improve downstream natural language understanding tasks.

[1]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[2]  Kevin Gimpel,et al.  Learning Paraphrastic Sentence Embeddings from Back-Translated Bitext , 2017, EMNLP.

[3]  Walter S. Lasecki,et al.  Understanding Task Design Trade-offs in Crowdsourced Paraphrase Collection , 2017, ACL.

[4]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[5]  Angeliki Lazaridou,et al.  Jointly optimizing word representations for lexical and sentential tasks with the C-PHRASE model , 2015, ACL.

[6]  Chris Callison-Burch,et al.  Paraphrasing with Bilingual Parallel Corpora , 2005, ACL.

[7]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[8]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[9]  Ondrej Dusek,et al.  CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered , 2016, TSD.

[10]  Regina Barzilay,et al.  Extracting Paraphrases from a Parallel Corpus , 2001, ACL.

[11]  Eneko Agirre,et al.  SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity , 2012, *SEMEVAL.

[12]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[13]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[14]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[15]  Mamoru Komachi,et al.  Building a Non-Trivial Paraphrase Corpus Using Multiple Machine Translation Systems , 2017, ACL.

[16]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[17]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[18]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[19]  Kevin Gimpel,et al.  From Paraphrase Database to Compositional Paraphrase Model and Back , 2015, Transactions of the Association for Computational Linguistics.

[20]  M. Marelli,et al.  SemEval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment , 2014, *SEMEVAL.

[21]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[22]  Hua He,et al.  A Continuously Growing Dataset of Sentential Paraphrases , 2017, EMNLP.

[23]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[24]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[25]  K. Pala,et al.  Text, Speech and Dialogue: 19th International Conference TSD 2016, Brno, Czech Republic, September 12-16, 2016 , 2016 .

[26]  Ilya Sutskever,et al.  Learning to Generate Reviews and Discovering Sentiment , 2017, ArXiv.

[27]  Mirella Lapata,et al.  Paraphrasing Revisited with Neural Machine Translation , 2017, EACL.

[28]  Kevin Gimpel,et al.  Revisiting Recurrent Networks for Paraphrastic Sentence Embeddings , 2017, ACL.

[29]  Rico Sennrich,et al.  Nematus: a Toolkit for Neural Machine Translation , 2017, EACL.

[30]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[33]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[34]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[35]  Dan Roth,et al.  Learning Question Classifiers , 2002, COLING.

[36]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[37]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[38]  Peng Zhou,et al.  Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling , 2016, COLING.

[39]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[40]  Chris Callison-Burch,et al.  The Multilingual Paraphrase Database , 2014, LREC.

[41]  Jacob Eisenstein,et al.  Discriminative Improvements to Distributional Sentence Similarity , 2013, EMNLP.

[42]  Han Zhao,et al.  Self-Adaptive Hierarchical Sentence Model , 2015, IJCAI.

[43]  David Kauchak,et al.  Simple English Wikipedia: A New Text Simplification Task , 2011, ACL.

[44]  Patrick Pantel,et al.  Discovery of inference rules for question-answering , 2001, Natural Language Engineering.

[45]  Zhe Gan,et al.  Learning Generic Sentence Representations Using Convolutional Neural Networks , 2016, EMNLP.

[46]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[47]  Alice Lai,et al.  Illinois-LH: A Denotational and Distributional Approach to Semantics , 2014, *SEMEVAL.

[48]  Claire Cardie,et al.  SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability , 2015, *SEMEVAL.

[49]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[50]  Felix Hill,et al.  Learning Distributed Representations of Sentences from Unlabelled Data , 2016, NAACL.

[51]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[52]  Yoshua Bengio,et al.  Not All Neural Embeddings are Born Equal , 2014, ArXiv.

[53]  David Vandyke,et al.  Counter-fitting Word Vectors to Linguistic Constraints , 2016, NAACL.

[54]  Claire Cardie,et al.  SemEval-2014 Task 10: Multilingual Semantic Textual Similarity , 2014, *SEMEVAL.

[55]  Yoshua Bengio,et al.  Embedding Word Similarity with Neural Machine Translation , 2014, ICLR.

[56]  Kevin Gimpel,et al.  Towards Universal Paraphrastic Sentence Embeddings , 2015, ICLR.

[57]  Chris Callison-Burch,et al.  SemEval-2015 Task 1: Paraphrase and Semantic Similarity in Twitter (PIT) , 2015, *SEMEVAL.

[58]  Kevin Gimpel,et al.  Charagram: Embedding Words and Sentences via Character n-grams , 2016, EMNLP.

[59]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[60]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[61]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[62]  Luke S. Zettlemoyer,et al.  Adversarial Example Generation with Syntactically Controlled Paraphrase Networks , 2018, NAACL.

[63]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[64]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[65]  Chris Callison-Burch,et al.  PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification , 2015, ACL.

[66]  Chris Quirk,et al.  Monolingual Machine Translation for Paraphrase Generation , 2004, EMNLP.

[67]  Yang Shao,et al.  HCTI at SemEval-2017 Task 1: Use convolutional neural network to evaluate Semantic Textual Similarity , 2017, SemEval@ACL.

[68]  Jeffrey Pennington,et al.  Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection , 2011, NIPS.

[69]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[70]  Chris Callison-Burch,et al.  Extracting Lexically Divergent Paraphrases from Twitter , 2014, TACL.

[71]  Eneko Agirre,et al.  *SEM 2013 shared task: Semantic Textual Similarity , 2013, *SEMEVAL.

[72]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[73]  Matteo Pagliardini,et al.  Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.