Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering

Producing diverse paraphrases of a sentence is a challenging task. Natural paraphrase corpora are scarce and limited, while existing large-scale resources are automatically generated via back-translation and rely on beam search, which tends to lack diversity. We describe ParaBank 2, a new resource that contains multiple diverse sentential paraphrases, produced from a bilingual corpus using negative constraints, inference sampling, and clustering.We show that ParaBank 2 significantly surpasses prior work in both lexical and syntactic diversity while being meaning-preserving, as measured by human judgments and standardized metrics. Further, we illustrate how such paraphrastic resources may be used to refine contextualized encoders, leading to improvements in downstream tasks.

[1]  Luke S. Zettlemoyer,et al.  Adversarial Example Generation with Syntactically Controlled Paraphrase Networks , 2018, NAACL.

[2]  Nitin Madnani,et al.  Re-examining Machine Translation Metrics for Paraphrase Identification , 2012, NAACL.

[3]  Matt Post,et al.  Sentential Paraphrasing as Black-Box Machine Translation , 2016, NAACL.

[4]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[5]  Benjamin Van Durme,et al.  Efficient Online Scalar Annotation with Bounded Support , 2018, ACL.

[6]  Oren Etzioni,et al.  Paraphrase-Driven Learning for Open Question Answering , 2013, ACL.

[7]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[8]  Matt Post,et al.  We start by defining the recurrent architecture as implemented in S OCKEYE , following , 2018 .

[9]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[10]  Daniel Jurafsky,et al.  Semantic Taxonomy Induction from Heterogenous Evidence , 2006, ACL.

[11]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[12]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[13]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[14]  Kevin Gimpel,et al.  Towards Universal Paraphrastic Sentence Embeddings , 2015, ICLR.

[15]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[16]  Martha Palmer,et al.  Verbnet: a broad-coverage, comprehensive verb lexicon , 2005 .

[17]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[18]  Matt Post,et al.  ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation , 2019, AAAI.

[19]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[20]  Jennifer Chu-Carroll,et al.  Answering the question you wish they had asked: The impact of paraphrasing for Question Answering , 2006, NAACL.

[21]  Wenpeng Yin,et al.  Convolutional Neural Network for Paraphrase Identification , 2015, NAACL.

[22]  Jonathan Berant,et al.  Semantic Parsing via Paraphrasing , 2014, ACL.

[23]  Marc'Aurelio Ranzato,et al.  Analyzing Uncertainty in Neural Machine Translation , 2018, ICML.

[24]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[25]  Isao Goto,et al.  Detecting Untranslated Content for Neural Machine Translation , 2017, NMT@ACL.

[26]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[27]  Chris Callison-Burch,et al.  FrameNet+: Fast Paraphrastic Tripling of FrameNet , 2015, ACL.

[28]  George D. C. Cavalcanti,et al.  Combining sentence similarities measures to identify paraphrases , 2018, Comput. Speech Lang..

[29]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[30]  Matt Post,et al.  Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation , 2018, NAACL.

[31]  Zhi Chen,et al.  Gated Convolutional Neural Network for Sentence Matching , 2018, INTERSPEECH.

[32]  Daniel Marcu,et al.  Syntax-based Alignment of Multiple Translations: Extracting Paraphrases and Generating New Sentences , 2003, NAACL.

[33]  Nikolaus Augsten,et al.  Efficient Computation of the Tree Edit Distance , 2015, TODS.

[34]  Nitin Madnani,et al.  Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods , 2010, CL.

[35]  Nikolaus Augsten,et al.  Tree edit distance: Robust and memory-efficient , 2016, Inf. Syst..

[36]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[37]  Regina Barzilay,et al.  Extracting Paraphrases from a Parallel Corpus , 2001, ACL.

[38]  Mirella Lapata,et al.  Learning to Paraphrase for Question Answering , 2017, EMNLP.

[39]  Chris Callison-Burch,et al.  Answer Extraction as Sequence Tagging with Tree Edit Distance , 2013, NAACL.

[40]  Karin M. Verspoor,et al.  Findings of the 2016 Conference on Machine Translation , 2016, WMT.

[41]  Chris Callison-Burch,et al.  PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification , 2015, ACL.

[42]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[43]  Anant V. Nimkar,et al.  A Hybrid Deep Learning Architecture for Paraphrase Identification , 2018, 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT).

[44]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[45]  Kevin Gimpel,et al.  Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations , 2017, ArXiv.

[46]  Miao Fan,et al.  A Globalization-Semantic Matching Neural Network for Paraphrase Identification , 2018, CIKM.

[47]  Marcin Junczys-Dowmunt,et al.  Dual Conditional Cross-Entropy Filtering of Noisy Parallel Corpora , 2018, WMT.

[48]  Daniel Marcu,et al.  HyTER: Meaning-Equivalent Semantics for Translation Evaluation , 2012, NAACL.

[49]  Ondrej Dusek,et al.  CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered , 2016, TSD.

[50]  Jacques Robin,et al.  Revision-based generation of natural language summaries providing historical background: corpus-based analysis, design, implementation and evaluation , 1995 .