Semi-Supervised Text Simplification with Back-Translation and Asymmetric Denoising Autoencoders

Text simplification (TS) rephrases long sentences into simplified variants while preserving inherent semantics. Traditional sequence-to-sequence models heavily rely on the quantity and quality of parallel sentences, which limits their applicability in different languages and domains. This work investigates how to leverage large amounts of unpaired corpora in TS task. We adopt the back-translation architecture in unsupervised machine translation (NMT), including denoising autoencoders for language modeling and automatic generation of parallel data by iterative back-translation. However, it is non-trivial to generate appropriate complex-simple pair if we directly treat the set of simple and complex corpora as two different languages, since the two types of sentences are quite similar and it is hard for the model to capture the characteristics in different types of sentences. To tackle this problem, we propose asymmetric denoising methods for sentences with separate complexity. When modeling simple and complex sentences with autoencoders, we introduce different types of noise into the training process. Such a method can significantly improve the simplification performance. Our model can be trained in both unsupervised and semi-supervised manner. Automatic and human evaluations show that our unsupervised model outperforms the previous systems, and with limited supervision, our model can perform competitively with multiple state-of-the-art simplification systems.

[1]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[2]  Ramakanth Pasunuru,et al.  Dynamic Multi-Level Multi-Task Learning for Sentence Simplification , 2018, COLING.

[3]  Chris Callison-Burch,et al.  Simple PPDB: A Paraphrase Database for Simplification , 2016, ACL.

[4]  Raman Chandrasekar,et al.  Motivations and Methods for Text Simplification , 1996, COLING.

[5]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[6]  Chris Callison-Burch,et al.  Complexity-Weighted Loss and Diverse Reranking for Sentence Simplification , 2019, NAACL.

[7]  Kai Yu,et al.  Data Augmentation with Atomic Templates for Spoken Language Understanding , 2019, EMNLP.

[8]  Chris Callison-Burch,et al.  Problems in Current Text Simplification Research: New Data Can Help , 2015, TACL.

[9]  Shashi Narayan,et al.  Unsupervised Sentence Simplification Using Deep Semantics , 2015, INLG.

[10]  Lucia Specia,et al.  Unsupervised Lexical Simplification for Non-Native Speakers , 2016, AAAI.

[11]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[12]  Guillaume Lample,et al.  Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.

[13]  Markus Freitag,et al.  Unsupervised Natural Language Generation with Denoising Autoencoders , 2018, EMNLP.

[14]  Hong Yu,et al.  Sentence Simplification with Memory-Augmented Neural Networks , 2018, NAACL.

[15]  Goran Glavas,et al.  Simplifying Lexical Simplification: Do We Need Simplified Corpora? , 2015, ACL.

[16]  Emiel Krahmer,et al.  Sentence Simplification by Monolingual Machine Translation , 2012, ACL.

[17]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[18]  Ari Rappoport,et al.  BLEU is Not Suitable for the Evaluation of Text Simplification , 2018, EMNLP.

[19]  Marie-Francine Moens,et al.  Text simplification for children , 2010, SIGIR 2010.

[20]  Guillaume Lample,et al.  Phrase-Based & Neural Unsupervised Machine Translation , 2018, EMNLP.

[21]  Sergiu Nisioi,et al.  Exploring Neural Text Simplification Models , 2017, ACL.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Bambang Parmanto,et al.  Integrating Transformer and Paraphrase Rules for Sentence Simplification , 2018, EMNLP.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[26]  Siddhartha Jonnalagadda,et al.  BioSimplify: an open source sentence simplification engine to improve recall in automatic biomedical information extraction , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[27]  Chris Callison-Burch,et al.  Optimizing Statistical Machine Translation for Text Simplification , 2016, TACL.

[28]  Jason Phang,et al.  Unsupervised Sentence Compression using Denoising Auto-Encoders , 2018, CoNLL.

[29]  Mirella Lapata,et al.  Sentence Simplification with Deep Reinforcement Learning , 2017, EMNLP.

[30]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[31]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[32]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[33]  Eneko Agirre,et al.  Unsupervised Neural Machine Translation , 2017, ICLR.

[34]  Kai Yu,et al.  Semantic Parsing with Dual Learning , 2019, ACL.

[35]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[36]  Mirella Lapata,et al.  Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming , 2011, EMNLP.