LENS: A Learnable Evaluation Metric for Text Simplification

Training learnable metrics using modern language models has recently emerged as a promising method for the automatic evaluation of machine translation. However, existing human evaluation datasets for text simplification have limited annotations that are based on unitary or outdated models, making them unsuitable for this approach. To address these issues, we introduce the SimpEval corpus that contains: SimpEval_past, comprising 12K human ratings on 2.4K simplifications of 24 past systems, and SimpEval_2022, a challenging simplification benchmark consisting of over 1K human ratings of 360 simplifications including GPT-3.5 generated text. Training on SimpEval, we present LENS, a Learnable Evaluation Metric for Text Simplification. Extensive empirical results show that LENS correlates much better with human judgment than existing metrics, paving the way for future progress in the evaluation of text simplification. We also introduce Rank & Rate, a human evaluation framework that rates simplifications from several models in a list-wise manner using an interactive interface, which ensures both consistency and accuracy in the evaluation process and is used to create the SimpEval datasets.

[1]  Junyi Jessy Li,et al.  News Summarization and Evaluation in the Era of GPT-3 , 2022, ArXiv.

[2]  Mohit Iyyer,et al.  RankGen: Improving Text Generation with Large Ranking Models , 2022, EMNLP.

[3]  José G. C. de Souza,et al.  Quality-Aware Decoding for Neural Machine Translation , 2022, NAACL.

[4]  Lidia S. Chao,et al.  RoBLEURT Submission for WMT2021 Metrics Task , 2022, WMT.

[5]  D. Roth,et al.  Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics , 2022, NAACL.

[6]  Zae Myung Kim,et al.  Understanding Iterative Revision from Human-Written Text , 2022, ACL.

[7]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[8]  Rico Sennrich,et al.  Identifying Weaknesses in Machine Translation Metrics Through Minimum Bayes Risk Decoding: A Case Study for COMET , 2022, AACL.

[9]  David Grangier,et al.  High Quality Rather than High Model Probability: Minimum Bayes Risk Decoding with Neural Metrics , 2021, Transactions of the Association for Computational Linguistics.

[10]  Kalpesh Krishna,et al.  Few-shot Controllable Style Transfer for Low-Resource Multilingual Settings , 2021, ACL.

[11]  Chris Callison-Burch,et al.  BiSECT: Learning to Split and Rephrase Sentences with Bitexts , 2021, EMNLP.

[12]  Lucia Specia,et al.  The (Un)Suitability of Automatic Evaluation Metrics for Text Simplification , 2021, CL.

[13]  Marcin Junczys-Dowmunt,et al.  To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation , 2021, WMT.

[14]  Noah A. Smith,et al.  Is GPT-3 Text Indistinguishable from Human Text? Scarecrow: A Framework for Scrutinizing Machine Text , 2021, Annual Meeting of the Association for Computational Linguistics.

[15]  Marine Carpuat,et al.  A Review of Human Evaluation for Style Transfer , 2021, GEM.

[16]  Chao Jiang,et al.  Neural semi-Markov CRF for Monolingual Word Alignment , 2021, ACL.

[17]  Dan Roth,et al.  A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods , 2021, Transactions of the Association for Computational Linguistics.

[18]  Eunsol Choi,et al.  Decontextualization: Making Sentences Stand-Alone , 2021, Transactions of the Association for Computational Linguistics.

[19]  Dimitra Gkatzia,et al.  Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions , 2020, INLG.

[20]  Yejin Choi,et al.  PowerTransformer: Unsupervised Controllable Revision for Biased Language Correction , 2020, EMNLP.

[21]  Wei Xu,et al.  Controllable Text Simplification with Explicit Paraphrasing , 2020, NAACL.

[22]  Mohit Iyyer,et al.  Reformulating Unsupervised Style Transfer as Paraphrase Generation , 2020, EMNLP.

[23]  Ryan Cotterell,et al.  If Beam Search Is the Answer, What Was the Question? , 2020, EMNLP.

[24]  Mohit Iyyer,et al.  Energy-Based Reranking: Improving Neural Machine Translation Using Energy-Based Models , 2020, ACL.

[25]  Alon Lavie,et al.  COMET: A Neural Framework for MT Evaluation , 2020, EMNLP.

[26]  Lili Mou,et al.  Iterative Edit-Based Unsupervised Sentence Simplification , 2020, ACL.

[27]  Lucia Specia,et al.  ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations , 2020, ACL.

[28]  Wei Xu,et al.  Neural CRF Model for Sentence Alignment in Text Simplification , 2020, ACL.

[29]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[30]  Diyi Yang,et al.  Automatically Neutralizing Subjective Bias in Text , 2019, AAAI.

[31]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[32]  Davis Liang,et al.  Masked Language Model Scoring , 2019, ACL.

[33]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[34]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[35]  Antoine Bordes,et al.  Controllable Sentence Simplification , 2019, LREC.

[36]  Albert Gatt,et al.  Best practices for the human evaluation of automatically generated text , 2019, INLG.

[37]  Bill Byrne,et al.  On NMT Search Errors and Model Errors: Cat Got Your Tongue? , 2019, EMNLP.

[38]  Fei Liu,et al.  MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.

[39]  Ondrej Bojar,et al.  Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges , 2019, WMT.

[40]  Aliaksei Severyn,et al.  Leveraging Pre-trained Checkpoints for Sequence Generation Tasks , 2019, Transactions of the Association for Computational Linguistics.

[41]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[42]  Zhifang Sui,et al.  Towards Fine-grained Text Sentiment Transfer , 2019, ACL.

[43]  Kevin Gimpel,et al.  Beyond BLEU:Training Neural Machine Translation with Semantic Similarity , 2019, ACL.

[44]  Jackie Chi Kit Cheung,et al.  EditNTS: An Neural Programmer-Interpreter Model for Sentence Simplification through Explicit Editing , 2019, ACL.

[45]  Chris Callison-Burch,et al.  ChatEval: A Tool for Chatbot Evaluation , 2019, NAACL.

[46]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[47]  Iyad Rahwan,et al.  Evaluating Style Transfer for Text , 2019, NAACL.

[48]  Percy Liang,et al.  Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.

[49]  Jinjun Xiong,et al.  Reinforcement Learning Based Text Style Transfer without Parallel Training Corpus , 2019, NAACL.

[50]  Jason Weston,et al.  What makes a good conversation? How controllable attributes affect human judgments , 2019, NAACL.

[51]  Kevin Gimpel,et al.  Unsupervised Evaluation Metrics and Learning Criteria for Non-Parallel Textual Transfer , 2018, EMNLP.

[52]  Anirban Laha,et al.  Unsupervised Neural Text Simplification , 2018, ACL.

[53]  Ondrej Bojar,et al.  Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance , 2018, WMT.

[54]  Xiaojun Wan,et al.  Automatic Text Simplification , 2018, Computational Linguistics.

[55]  Ari Rappoport,et al.  Simple and Effective Text Simplification Using Semantic and Neural Methods , 2018, ACL.

[56]  Ari Rappoport,et al.  Semantic Structural Evaluation for Text Simplification , 2018, NAACL.

[57]  Yulia Tsvetkov,et al.  Style Transfer Through Back-Translation , 2018, ACL.

[58]  Joel R. Tetreault,et al.  Dear Sir or Madam, May I Introduce the GYAFC Dataset: Corpus, Benchmarks and Metrics for Formality Style Transfer , 2018, NAACL.

[59]  Ondrej Dusek,et al.  RankME: Reliable Human Ratings for Natural Language Generation , 2018, NAACL.

[60]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[61]  Eneko Agirre,et al.  Unsupervised Neural Machine Translation , 2017, ICLR.

[62]  Ondrej Bojar,et al.  Results of the WMT17 Metrics Shared Task , 2017, WMT.

[63]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[64]  Mirella Lapata,et al.  Sentence Simplification with Deep Reinforcement Learning , 2017, EMNLP.

[65]  Eric P. Xing,et al.  Toward Controlled Generation of Text , 2017, ICML.

[66]  Chris Callison-Burch,et al.  Optimizing Statistical Machine Translation for Text Simplification , 2016, TACL.

[67]  Jianfeng Gao,et al.  Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.

[68]  Yang Liu,et al.  Minimum Risk Training for Neural Machine Translation , 2015, ACL.

[69]  S. Chopra,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[70]  Dimitra Gkatzia,et al.  A Snapshot of NLG Evaluation Practices 2005 - 2014 , 2015, ENLG.

[71]  Chris Callison-Burch,et al.  Problems in Current Text Simplification Research: New Data Can Help , 2015, TACL.

[72]  Shashi Narayan,et al.  Hybrid Simplification using Deep Semantics and Machine Translation , 2014, ACL.

[73]  David Kauchak,et al.  Improving Text Simplification Language Modeling Using Unsimplified Text Data , 2013, ACL.

[74]  Timothy Baldwin,et al.  Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.

[75]  Emiel Krahmer,et al.  Sentence Simplification by Monolingual Machine Translation , 2012, ACL.

[76]  Mark Hopkins,et al.  Tuning as Ranking , 2011, EMNLP.

[77]  Iryna Gurevych,et al.  A Monolingual Tree-based Translation Model for Sentence Simplification , 2010, COLING.

[78]  Hermann Ney,et al.  Human Evaluation of Machine Translation Through Binary System Comparisons , 2007, WMT@ACL.

[79]  Philipp Koehn,et al.  (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[80]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[81]  David A. Smith,et al.  Minimum Risk Annealing for Training Log-Linear Models , 2006, ACL.

[82]  K. Krippendorff Reliability in Content Analysis: Some Common Misconceptions and Recommendations , 2004 .

[83]  Shankar Kumar,et al.  Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2004, NAACL.

[84]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[85]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[86]  Antoine Bordes,et al.  MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases , 2022, LREC.

[87]  Horacio Saggion,et al.  Controllable Sentence Simplification with a Unified Text-to-Text Transfer Transformer , 2021, INLG.

[88]  Alon Lavie,et al.  Are References Really Needed? Unbabel-IST 2021 Submission for the Metrics Shared Task , 2021, WMT.

[89]  Angela Fan Text Generation with and without Retrieval. (Génération de textes basés sur la connaissance avec et sans recherche) , 2021 .

[90]  Markus Freitag,et al.  Findings of the 2021 Conference on Machine Translation (WMT21) , 2021, WMT.

[91]  Marc'Aurelio Ranzato,et al.  Discriminative Reranking for Neural Machine Translation , 2021, ACL.

[92]  Philipp Koehn,et al.  Findings of the 2020 Conference on Machine Translation (WMT20) , 2020, WMT.

[93]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[94]  Philipp Koehn,et al.  Synthesis Lectures on Human Language Technologies , 2016 .

[95]  S. Amershi,et al.  A Dataset and Evaluation Metrics for Abstractive Compression of Sentences and Short Paragraphs , 2016, EMNLP.

[96]  Rico Sennrich,et al.  Controlling Politeness in Neural Machine Translation via Side Constraints , 2016, NAACL.

[97]  Ralph Grishman,et al.  Paraphrasing for Style , 2012, COLING.

[98]  Klaus Krippendorff,et al.  Computing Krippendorff's Alpha-Reliability , 2011 .

[99]  Nathan Schneider,et al.  Association for Computational Linguistics: Human Language Technologies , 2011 .

[100]  Anoop Sarkar,et al.  Discriminative Reranking for Machine Translation , 2004, NAACL.