BLEU is Not Suitable for the Evaluation of Text Simplification

BLEU is widely considered to be an informative metric for text-to-text generation, including Text Simplification (TS). TS includes both lexical and structural aspects. In this paper we show that BLEU is not suitable for the evaluation of sentence splitting, the major structural simplification operation. We manually compiled a sentence splitting gold standard corpus containing multiple structural paraphrases, and performed a correlation analysis with human judgments. We find low or no correlation between BLEU and the grammaticality and meaning preservation parameters where sentence splitting is involved. Moreover, BLEU often negatively correlates with simplicity, essentially penalizing simpler sentences.

[1]  Emiel Krahmer,et al.  Sentence Simplification by Monolingual Machine Translation , 2012, ACL.

[2]  Sanja Stajner,et al.  A Deeper Exploration of the Standard PB-SMT Approach to Text Simplification and its Evaluation , 2015, ACL.

[3]  Iryna Gurevych,et al.  A Monolingual Tree-based Translation Model for Sentence Simplification , 2010, COLING.

[4]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[5]  Xu Sun,et al.  A Semantic Relevance Based Neural Network for Text Summarization and Text Simplification , 2017, ArXiv.

[6]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[7]  Junyi Jessy Li,et al.  Detecting Content-Heavy Sentences: A Cross-Language Case Study , 2015, EMNLP.

[8]  Shashi Narayan,et al.  Hybrid Simplification using Deep Semantics and Machine Translation , 2014, ACL.

[9]  Yvette Graham,et al.  Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE , 2015, EMNLP.

[10]  Advaith Siddharthan,et al.  Hybrid text simplification using synchronous dependency grammars with hand-written and automatically harvested rules , 2014, EACL.

[11]  Chris Callison-Burch,et al.  Optimizing Statistical Machine Translation for Text Simplification , 2016, TACL.

[12]  Roger Levy,et al.  Automated Whole Sentence Grammar Correction Using a Noisy Channel Model , 2011, ACL.

[13]  Dongyan Zhao,et al.  A Constrained Sequence-to-Sequence Neural Model for Sentence Simplification , 2017, ArXiv.

[14]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[15]  Advaith Siddharthan,et al.  Syntactic Simplification and Text Cohesion , 2006 .

[16]  Dipti Misra Sharma,et al.  Exploring the effects of Sentence Simplification on Hindi to English Machine Translation System , 2014 .

[17]  Hong Sun,et al.  Joint Learning of a Dual SMT System for Paraphrase Generation , 2012, ACL.

[18]  Shashi Narayan,et al.  Unsupervised Sentence Simplification Using Deep Semantics , 2015, INLG.

[19]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[20]  Sanja Stajner,et al.  One Step Closer to Automatic Evaluation of Text Simplification Systems , 2014, PITR@EACL.

[21]  Joel R. Tetreault,et al.  An Empirical Analysis of Formality in Online Communication , 2016, TACL.

[22]  Jana M. Mason,et al.  Facilitating Reading Comprehension through Text Structure Manipulation. , 1979 .

[23]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[24]  Ari Rappoport,et al.  Simple and Effective Text Simplification Using Semantic and Neural Methods , 2018, ACL.

[25]  Sergiu Nisioi,et al.  Exploring Neural Text Simplification Models , 2017, ACL.

[26]  Ehud Reiter,et al.  Experiments with discourse-level choices and readability , 2003, ENLG@EACL.

[27]  Shashi Narayan,et al.  Split and Rephrase , 2017, EMNLP.

[28]  Mirella Lapata,et al.  Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming , 2011, EMNLP.

[29]  Mirella Lapata,et al.  Sentence Simplification with Deep Reinforcement Learning , 2017, EMNLP.

[30]  Philipp Koehn,et al.  Manual and Automatic Evaluation of Machine Translation between European Languages , 2006, WMT@HLT-NAACL.

[31]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[32]  Ari Rappoport,et al.  Semantic Structural Evaluation for Text Simplification , 2018, NAACL.

[33]  Yoav Goldberg,et al.  Split and Rephrase: Better Evaluation and a Stronger Baseline , 2018, ACL.