ParaSCI: A Large Scientific Paraphrase Dataset for Longer Paraphrase Generation

We propose ParaSCI, the first large-scale paraphrase dataset in the scientific field, including 33,981 paraphrase pairs from ACL (ParaSCI-ACL) and 316,063 pairs from arXiv (ParaSCI-arXiv). Digging into characteristics and common patterns of scientific papers, we construct this dataset though intra-paper and inter-paper methods, such as collecting citations to the same paper or aggregating definitions by scientific terms. To take advantage of sentences paraphrased partially, we put up PDBERT as a general paraphrase discovering method. The major advantages of paraphrases in ParaSCI lie in the prominent length and textual diversity, which is complementary to existing paraphrase datasets. ParaSCI obtains satisfactory results on human evaluation and downstream tasks, especially long paraphrase generation.

[1]  Kyle Lo,et al.  S2ORC: The Semantic Scholar Open Research Corpus , 2020, ACL.

[2]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[3]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[4]  Chris Callison-Burch,et al.  SemEval-2015 Task 1: Paraphrase and Semantic Similarity in Twitter (PIT) , 2015, *SEMEVAL.

[5]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[6]  Rashmi Prasad,et al.  The Penn Discourse Treebank , 2004, LREC.

[7]  Chris Callison-Burch,et al.  PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification , 2015, ACL.

[8]  Chutima Boonthum iSTART: Paraphrase Recognition , 2004, ACL 2004.

[9]  Mirella Lapata,et al.  Paraphrasing Revisited with Neural Machine Translation , 2017, EACL.

[10]  Xiaodong Liu,et al.  Stochastic Answer Networks for Machine Reading Comprehension , 2017, ACL.

[11]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[12]  Hua He,et al.  A Continuously Growing Dataset of Sentential Paraphrases , 2017, EMNLP.

[13]  Eduard H. Hovy,et al.  Squibs: What Is a Paraphrase? , 2013, CL.

[14]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[15]  Xifeng Yan,et al.  Cross-domain Semantic Parsing via Paraphrasing , 2017, EMNLP.

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[18]  Ankush Gupta,et al.  A Deep Generative Framework for Paraphrase Generation , 2017, AAAI.

[19]  Yansong Feng,et al.  Paraphrase Generation with Latent Bag of Words , 2020, NeurIPS.

[20]  Mirella Lapata,et al.  Learning to Paraphrase for Question Answering , 2017, EMNLP.

[21]  Kevin Gimpel,et al.  Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations , 2017, ArXiv.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.