Scarecrow: A Framework for Scrutinizing Machine Text

Modern neural text generation systems can produce remarkably fluent and grammatical texts. While earlier language models suffered from repetition and syntactic errors, the errors made by contemporary models are often semantic, narrative, or discourse failures. To facilitate research of these complex error types, we introduce a new structured, crowdsourced error annotation schema called SCARECROW. The error categories used in SCARECROW—such as redundancy, commonsense errors, and incoherence—were identified by combining expert analysis with several pilot rounds of ontology-free crowd annotation to arrive at a schema which covers the error phenomena found in real machine generated text. We use SCARECROW to collect 13k annotations of 1.3k human and machine generate paragraphs of English language news text, amounting to over 41k spans each labeled with its error category, severity, a natural language explanation, and antecedent span (where relevant). We collect annotations for text generated by state-of-the-art systems with varying known performance levels, from GPT-2 Small through the largest GPT-3. We isolate several factors for detailed analysis, including parameter count, training data, and decoding technique. Our results show both expected and surprising differences across these settings. These findings demonstrate the value of SCARECROW annotations in the assessment of current and future text generation systems. We release our complete annotation toolkit and dataset at https://yao-dou. github.io/scarecrow/.

[1]  Verena Rieser,et al.  RankME: Reliable Human Ratings for Natural Language Generation , 2018, NAACL.

[2]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[3]  Elizabeth Clark,et al.  Evaluation of Text Generation: A Survey , 2020, ArXiv.

[4]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[5]  Jungo Kasai,et al.  GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation , 2021, ArXiv.

[6]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL/IJCNLP.

[7]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[8]  Minlie Huang,et al.  UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation , 2020, EMNLP.

[9]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[10]  Tom Feltwell,et al.  Rethinking Engagement with Online News through Social and Visual Co-Annotation , 2018, CHI.

[11]  Hugo Liu,et al.  ConceptNet — A Practical Commonsense Reasoning Tool-Kit , 2004 .

[12]  Roger C. Schank,et al.  Scripts, plans, goals and understanding: an inquiry into human knowledge structures , 1978 .

[13]  M. de Rijke,et al.  Light-Weight Entailment Checking for Computational Semantics , 2001 .

[14]  Laria Reynolds,et al.  Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm , 2021, CHI Extended Abstracts.

[15]  Percy Liang,et al.  Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.

[16]  Joelle Pineau,et al.  Language GANs Falling Short , 2018, ICLR.

[17]  Yann Mathet,et al.  The Unified and Holistic Method Gamma (γ) for Inter-Annotator Agreement Measure and Alignment , 2015, CL.

[18]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Mitesh M. Khapra,et al.  A Survey of Evaluation Metrics Used for NLG Systems , 2020, ACM Comput. Surv..

[21]  Yejin Choi,et al.  MAUVE: Human-Machine Divergence Curves for Evaluating Open-Ended Text Generation , 2021, ArXiv.

[22]  Christopher D. Manning,et al.  Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.

[23]  Noah A. Smith,et al.  Choose Your Own Adventure: Paired Suggestions in Collaborative Writing for Evaluating Story Generation Models , 2021, NAACL.

[24]  Matthew Richardson,et al.  MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text , 2013, EMNLP.

[25]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[26]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[27]  Chris Callison-Burch,et al.  RoFT: A Tool for Evaluating Human Detection of Machine-Generated Text , 2020, EMNLP.

[28]  Yejin Choi,et al.  Learning to Write with Cooperative Discriminators , 2018, ACL.

[29]  Hannaneh Hajishirzi,et al.  Entity, Relation, and Event Extraction with Contextualized Span Representations , 2019, EMNLP.

[30]  Ali Farhadi,et al.  TuringAdvice: A Generative and Dynamic Evaluation of Language Use , 2021, NAACL.

[31]  Jing Gu,et al.  Perception Score, A Learned Metric for Open-ended Text Generation Evaluation , 2020, ArXiv.

[32]  Dimitra Gkatzia,et al.  Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions , 2020, INLG.

[33]  Colin Raffel,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[34]  Peter Henderson,et al.  With Little Power Comes Great Responsibility , 2020, EMNLP.

[35]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[36]  Siobhan Chapman Logic and Conversation , 2005 .