Neural Text Simplification in Low-Resource Conditions Using Weak Supervision

Neural text simplification has gained increasing attention in the NLP community thanks to recent advancements in deep sequence-tosequence learning. Most recent efforts with such a data-demanding paradigm have dealt with the English language, for which sizeable training datasets are currently available to deploy competitive models. Similar improvements on less resource-rich languages are conditioned either to intensive manual work to create training data, or to the design of effective automatic generation techniques to bypass the data acquisition bottleneck. Inspired by the machine translation field, in which synthetic parallel pairs generated from monolingual data yield significant improvements to neural models, in this paper we exploit large amounts of heterogeneous data to automatically select simple sentences, which are then used to create synthetic simplification pairs. We also evaluate other solutions, such as oversampling and the use of external word embeddings to be fed to the neural simplification system. Our approach is evaluated on Italian and Spanish, for which few thousand gold sentence pairs are available. The results show that these techniques yield performance improvements over a baseline sequence-to-sequence configuration.

[1]  Mabel Crawford,et al.  The Art of Plain Talk , 1969 .

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Paolo Rosso,et al.  CATS: A Tool for Customized Alignment of Text Simplification Corpora , 2018, LREC.

[4]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[5]  Sara Tonelli,et al.  SIMPITIKI: a Simplification corpus for Italian , 2016, CLiC-it/EVALITA.

[6]  Horacio Saggion,et al.  A Hybrid System for Spanish Text Simplification , 2012, SLPAT@HLT-NAACL.

[7]  Santanu Pal,et al.  Multi-source Neural Automatic Post-Editing: FBK’s participation in the WMT 2017 APE shared task , 2017, WMT.

[8]  Mirella Lapata,et al.  Sentence Simplification with Deep Reinforcement Learning , 2017, EMNLP.

[9]  Sergiu Nisioi,et al.  Exploring Neural Text Simplification Models , 2017, ACL.

[10]  Iryna Gurevych,et al.  A Monolingual Tree-based Translation Model for Sentence Simplification , 2010, COLING.

[11]  Goran Glavas,et al.  Spanish NER with Word Representations and Conditional Random Fields , 2016, NEWS@ACM.

[12]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[13]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Ari Rappoport,et al.  Simple and Effective Text Simplification Using Semantic and Neural Methods , 2018, ACL.

[15]  Olivier Pietquin,et al.  Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation , 2016, NIPS 2016.

[16]  Chris Callison-Burch,et al.  Optimizing Statistical Machine Translation for Text Simplification , 2016, TACL.

[17]  Lijun Wu,et al.  Achieving Human Parity on Automatic Chinese to English News Translation , 2018, ArXiv.

[18]  Chris Callison-Burch,et al.  Problems in Current Text Simplification Research: New Data Can Help , 2015, TACL.

[19]  Delphine Bernhard,et al.  Simplification syntaxique de phrases pour le français (Syntactic Simplification for French Sentences) [in French] , 2012, JEP-TALN-RECITAL.

[20]  Lucia Specia,et al.  Learning Simplifications for Specific Target Audiences , 2018, ACL.

[21]  Felice Dell'Orletta,et al.  Assessing the Readability of Sentences: Which Corpora and Features? , 2014, BEA@ACL.

[22]  Danielle S. McNamara,et al.  Text simplification and comprehensible input: A case for an intuitive approach , 2012 .

[23]  Yuan Cao,et al.  Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Sara Tonelli,et al.  MUSST: A Multilingual Syntactic Simplification Tool , 2017, IJCNLP.

[25]  Raman Chandrasekar,et al.  Automatic induction of rules for text simplification , 1997, Knowl. Based Syst..

[26]  Felice Dell'Orletta,et al.  Design and Annotation of the First Italian Corpus for Text Simplification , 2015, LAW@NAACL-HLT.

[27]  Daphne Koller,et al.  Applying Sentence Simplification to the CoNLL-2008 Shared Task , 2008, CoNLL.

[28]  Rico Sennrich,et al.  Edinburgh Neural Machine Translation Systems for WMT 16 , 2016, WMT.

[29]  Vito Pirrelli,et al.  The PAISÀ Corpus of Italian Web Texts , 2014, WaC@EACL.

[30]  Matthew Shardlow,et al.  A Survey of Automated Text Simplification , 2014 .

[31]  Sara Tonelli,et al.  ERNESTA: A Sentence Simplification Tool for Children's Stories in Italian , 2013, CICLing.

[32]  Krisztian Balog,et al.  Generating Synthetic Data for Neural Keyword-to-Question Models , 2018, ICTIR.

[33]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[34]  Mirella Lapata,et al.  Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming , 2011, EMNLP.

[35]  Paloma Martínez,et al.  LABDA at the 2016 TASS Challenge Task: Using Word Embeddings for the Sentiment Analysis Task , 2016, TASS@SEPLN.

[36]  Felice Dell'Orletta,et al.  PaCCSS-IT: A Parallel Corpus of Complex-Simple Sentences for Automatic Text Simplification , 2016, EMNLP.

[37]  cationR. Chandrasekar Automatic Induction of Rules for Text Simpli , 1997 .