SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation

Reliable automatic evaluation of summarization systems is challenging due to the multifaceted and subjective nature of the task. This is especially the case for languages other than English, where human evaluations are scarce. In this work, we introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation. SEAHORSE consists of 96K summaries with human ratings along 6 quality dimensions: comprehensibility, repetition, grammar, attribution, main ideas, and conciseness, covering 6 languages, 9 systems and 4 datasets. As a result of its size and scope, SEAHORSE can serve both as a benchmark to evaluate learnt metrics, as well as a large-scale resource for training such metrics. We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE (Honovich et al., 2022) and mFACE (Aharoni et al., 2022). We make SEAHORSE publicly available for future research on multilingual and multifaceted summarization evaluation.

[1]  Dan Iter,et al.  G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment , 2023, EMNLP.

[2]  A. Borji A Categorical Archive of ChatGPT Failures , 2023, ArXiv.

[3]  Mirella Lapata,et al.  mFACE: Multilingual Summarization with Factual Consistency Evaluation , 2022, ACL.

[4]  Shafiq R. Joty,et al.  Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation , 2022, ACL.

[5]  Y. Matias,et al.  TRUE: Re-evaluating Factual Consistency Evaluation , 2022, NAACL.

[6]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[7]  Marc van Zee,et al.  Scaling Up Models and Data with t5x and seqio , 2022, ArXiv.

[8]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[9]  Sebastian Gehrmann,et al.  Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text , 2022, J. Artif. Intell. Res..

[10]  Pascale Fung,et al.  Survey of Hallucination in Natural Language Generation , 2022, ACM Comput. Surv..

[11]  Gaurav Singh Tomar,et al.  Measuring Attribution in Natural Language Generation Models , 2021, Computational Linguistics.

[12]  Yejin Choi,et al.  Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts , 2021, NAACL.

[13]  Wenhao Liu,et al.  DialFact: A Benchmark for Fact-Checking in Dialogue , 2021, ACL.

[14]  Marcin Junczys-Dowmunt,et al.  To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation , 2021, WMT.

[15]  Rifat Shahriyar,et al.  XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages , 2021, FINDINGS.

[16]  David Reitter,et al.  Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark , 2021, Transactions of the Association for Computational Linguistics.

[17]  Markus Freitag,et al.  Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation , 2021, Transactions of the Association for Computational Linguistics.

[18]  Artidoro Pagnoni,et al.  Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics , 2021, NAACL.

[19]  Idan Szpektor,et al.  Q^{2}: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering , 2021, EMNLP.

[20]  Regina Barzilay,et al.  Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence , 2021, NAACL.

[21]  Diyi Yang,et al.  The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.

[22]  Mona T. Diab,et al.  Detecting Hallucinated Content in Conditional Neural Sequence Generation , 2020, FINDINGS.

[23]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.

[24]  Graham Neubig,et al.  Re-evaluating Evaluation in Text Summarization , 2020, EMNLP.

[25]  Claire Cardie,et al.  WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization , 2020, FINDINGS.

[26]  Alon Lavie,et al.  COMET: A Neural Framework for MT Evaluation , 2020, EMNLP.

[27]  Ryan J. Lowe,et al.  Learning to summarize from human feedback , 2020, NeurIPS 2020.

[28]  Dragomir R. Radev,et al.  SummEval: Re-evaluating Summarization Evaluation , 2020, Transactions of the Association for Computational Linguistics.

[29]  Ryan McDonald,et al.  On Faithfulness and Factuality in Abstractive Summarization , 2020, ACL.

[30]  Sylvain Lamprier,et al.  MLSUM: The Multilingual Summarization Corpus , 2020, EMNLP.

[31]  Diyi Yang,et al.  ToTTo: A Controlled Table-To-Text Generation Dataset , 2020, EMNLP.

[32]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[33]  Alex Wang,et al.  Asking and Answering Questions to Evaluate the Factual Consistency of Summaries , 2020, ACL.

[34]  Richard Socher,et al.  Evaluating the Factual Consistency of Abstractive Text Summarization , 2019, EMNLP.

[35]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[36]  Ankur P. Parikh,et al.  Sticking to the Facts: Confident Decoding for Faithful Data-to-Text Generation , 2019, ArXiv.

[37]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[38]  Jason Baldridge,et al.  PAWS: Paraphrase Adversaries from Word Scrambling , 2019, NAACL.

[39]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[40]  Mirella Lapata,et al.  Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[41]  Peter Clark,et al.  SciTaiL: A Textual Entailment Dataset from Science Question Answering , 2018, AAAI.

[42]  Andreas Vlachos,et al.  FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.

[43]  Benno Stein,et al.  TL;DR: Mining Reddit to Learn Automatic Summarization , 2017, NFiS@EMNLP.

[44]  Alexander M. Rush,et al.  Challenges in Data-to-Document Generation , 2017, EMNLP.

[45]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[46]  Ondrej Bojar,et al.  Results of the WMT16 Metrics Shared Task , 2016, WMT.

[47]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[48]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[49]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[50]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[51]  A. Lavie,et al.  Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-based Human Evaluations on TED and News Domain , 2021, WMT.

[52]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[53]  Walter Daelemans,et al.  Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2014, EMNLP 2014.

[54]  Yaroslav Fyodorov,et al.  A Natural Logic Inference System , 2000 .