Simple English Wikipedia: A New Text Simplification Task

In this paper we examine the task of sentence simplification which aims to reduce the reading complexity of a sentence by incorporating more accessible vocabulary and sentence structure. We introduce a new data set that pairs English Wikipedia with Simple English Wikipedia and is orders of magnitude larger than any previously examined for sentence simplification. The data contains the full range of simplification operations including rewording, reordering, insertion and deletion. We provide an analysis of this corpus as well as preliminary results using a phrase-based translation approach for simplification.

[1]  Tadashi Nomoto A Comparison of Model Free versus Model Intensive Approaches to Sentence Compression , 2009, EMNLP.

[2]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[3]  Daphne Koller,et al.  Sentence Simplification for Semantic Role Labeling , 2008, ACL.

[4]  Hermann Ney,et al.  The Alignment Template Approach to Statistical Machine Translation , 2004, CL.

[5]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[6]  Eugene Charniak,et al.  Supervised and Unsupervised Learning for Sentence Compression , 2005, ACL.

[7]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[8]  Raymond J. Mooney,et al.  Discriminative Reranking for Semantic Parsing , 2006, ACL.

[9]  Tadashi Nomoto,et al.  Discriminative sentence compression with conditional random fields , 2007, Inf. Process. Manag..

[10]  Ryan T. McDonald Discriminative Sentence Compression with Soft Syntactic Evidence , 2006, EACL.

[11]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[12]  Kathleen McKeown,et al.  Lexicalized Markov Grammars for Sentence Compression , 2007, NAACL.

[13]  Jun'ichi Tsujii,et al.  Entity-Focused Sentence Simplification for Relation Extraction , 2010, COLING.

[14]  Cristian Danescu-Niculescu-Mizil,et al.  For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia , 2010, NAACL.

[15]  Mark Dredze,et al.  Learning Simple Wikipedia: A Cogitation in Ascertaining Abecedarian Language , 2010, HLT-NAACL 2010.

[16]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[17]  Stuart M. Shieber,et al.  Towards Robust Context-Sensitive Sentence Alignment for Monolingual Corpora , 2006, EACL.

[18]  Siddhartha Jonnalagadda,et al.  Towards Effective Sentence Simplification for Automatic Processing of Biomedical Text , 2009, HLT-NAACL.

[19]  Mirella Lapata,et al.  Sentence Compression as Tree Transduction , 2009, J. Artif. Intell. Res..

[20]  Alexander M. Fraser,et al.  A Smorgasbord of Features for Statistical Machine Translation , 2004, NAACL.

[21]  Regina Barzilay,et al.  Sentence Alignment for Monolingual Comparable Corpora , 2003, EMNLP.

[22]  David Chiang,et al.  Learning to Translate with Source and Target Syntax , 2010, ACL.

[23]  Daniel Marcu,et al.  Summarization beyond sentence extraction: A probabilistic approach to sentence compression , 2002, Artif. Intell..

[24]  Mirella Lapata,et al.  Models for Sentence Compression: A Comparison across Domains, Training Requirements and Evaluation Measures , 2006, ACL.

[25]  Elif Yamangil,et al.  Mining Wikipedia Revision Histories for Improving Sentence Compression , 2008, ACL.

[26]  Emily Pitler,et al.  Methods for Sentence Compression , 2010 .

[27]  Tadashi Nomoto A Generic Sentence Trimmer with CRFs , 2008, ACL.

[28]  Raman Chandrasekar,et al.  Automatic induction of rules for text simplification , 1997, Knowl. Based Syst..