BiSECT: Learning to Split and Rephrase Sentences with Bitexts

An important task in NLP applications such as sentence simplification is the ability to take a long, complex sentence and split it into shorter sentences, rephrasing as necessary. We introduce a novel dataset and a new model for this ‘split and rephrase’ task. Our BISECT training data consists of 1 million long English sentences paired with shorter, meaning-equivalent English sentences. We obtain these by extracting 1-2 sentence alignments in bilingual parallel corpora and then using machine translation to convert both sides of the corpus into the same language. BISECT contains higher quality training examples than previous Split and Rephrase corpora, with sentence splits that require more significant modifications. We categorize examples in our corpus, and use these categories in a novel model that allows us to target specific regions of the input sentence to be split and edited. Moreover, we show that models trained on BISECT can perform a wider variety of split operations and improve upon previous state-of-the-art approaches in automatic and human evaluations.1

[1]  Yoav Goldberg,et al.  Split and Rephrase: Better Evaluation and a Stronger Baseline , 2018, ACL.

[2]  Mirella Lapata,et al.  Sentence Compression for Arbitrary Languages via Multilingual Pivoting , 2018, EMNLP.

[3]  Matt Post,et al.  Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering , 2019, CoNLL.

[4]  Shashi Narayan,et al.  Creating Training Corpora for NLG Micro-Planners , 2017, ACL.

[5]  Joakim Nivre,et al.  Analyzing and Integrating Dependency Parsers , 2011, CL.

[6]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[7]  Wei Xu,et al.  Controllable Text Simplification with Explicit Paraphrasing , 2020, NAACL.

[8]  Philipp Koehn,et al.  Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[9]  Mirella Lapata,et al.  Paraphrasing Revisited with Neural Machine Translation , 2017, EACL.

[10]  Manaal Faruqui,et al.  Learning To Split and Rephrase From Wikipedia Edit History , 2018, EMNLP.

[11]  Shashi Narayan,et al.  Hybrid Simplification using Deep Semantics and Machine Translation , 2014, ACL.

[12]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Shashi Narayan,et al.  Split and Rephrase , 2017, EMNLP.

[15]  Matt Post,et al.  ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation , 2019, AAAI.

[16]  Chris Callison-Burch,et al.  Paraphrasing with Bilingual Parallel Corpora , 2005, ACL.

[17]  Philipp Koehn,et al.  Clause Restructuring for Statistical Machine Translation , 2005, ACL.

[18]  Rebecca J. Passonneau,et al.  ABCD: A Graph Framework to Convert Complex Sentences to a Covering Set of Simple Sentences , 2021, ACL.