ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding

We introduce ZeroSCROLLS, a zero-shot benchmark for natural language understanding over long texts, which contains only test sets, without training or development data. We adapt six tasks from the SCROLLS benchmark, and add four new datasets, including two novel information fusing tasks, such as aggregating the percentage of positive reviews. Using ZeroSCROLLS, we conduct a comprehensive evaluation of both open-source and closed large language models, finding that Claude outperforms ChatGPT, and that GPT-4 achieves the highest average score. However, there is still room for improvement on multiple open challenges in ZeroSCROLLS, such as aggregation tasks, where models struggle to pass the naive baseline. As the state of the art is a moving target, we invite researchers to evaluate their ideas on the live ZeroSCROLLS leaderboard

[1]  Matthew R. Gormley,et al.  Unlimiformer: Long-Range Transformers with Unlimited Length Input , 2023, ArXiv.

[2]  David C. Uthus,et al.  CoLT5: Faster Long-Range Transformers with Conditional Computation , 2023, ArXiv.

[3]  Jonathan Berant,et al.  Efficient Long-Text Understanding with Short-Text Models , 2022, TACL.

[4]  Christopher D. Manning,et al.  Holistic Evaluation of Language Models , 2023, Annals of the New York Academy of Sciences.

[5]  Anchit Gupta,et al.  Adapting Pretrained Text-to-Text Models for Long Text Sequences , 2022, EMNLP.

[6]  Peter J. Liu,et al.  Investigating Efficiently Extending Transformers for Long Input Summarization , 2022, EMNLP.

[7]  Gerard de Melo,et al.  Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[8]  Samuel R. Bowman,et al.  SQuALITY: Building a Long-Document Summarization Dataset the Hard Way , 2022, EMNLP.

[9]  Hyung Won Chung,et al.  UL2: Unifying Language Learning Paradigms , 2022, ICLR.

[10]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[11]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[12]  Omer Levy,et al.  SCROLLS: Standardized CompaRison Over Long Language Sequences , 2022, EMNLP.

[13]  Richard Yuanzhe Pang,et al.  QuALITY: Question Answering with Long Input Texts, Yes! , 2021, NAACL.

[14]  David C. Uthus,et al.  LongT5: Efficient Text-To-Text Transformer for Long Sequences , 2021, NAACL-HLT.

[15]  Alexander M. Rush,et al.  Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[16]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[17]  Ashish Sabharwal,et al.  ♫ MuSiQue: Multihop Questions via Single-hop Question Composition , 2021, TACL.

[18]  Dragomir R. Radev,et al.  BookSum: A Collection of Datasets for Long-form Narrative Summarization , 2021, EMNLP.

[19]  Kevin Gimpel,et al.  SummScreen: A Dataset for Abstractive Screenplay Summarization , 2021, ACL.

[20]  Noah A. Smith,et al.  A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers , 2021, NAACL.

[21]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[22]  Shuyang Cao,et al.  Efficient Attentions for Long Document Summarization , 2021, NAACL.

[23]  Dragomir R. Radev,et al.  QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization , 2021, NAACL.

[24]  Mirella Lapata,et al.  Extractive Opinion Summarization in Quantized Transformer Spaces , 2020, Transactions of the Association for Computational Linguistics.

[25]  Timo Schick,et al.  Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference , 2020, EACL.

[26]  Hinrich Schütze,et al.  Few-Shot Text Generation with Natural Language Instructions , 2020, EMNLP.

[27]  Omer Levy,et al.  The Turking Test: Can Language Models Understand Instructions? , 2020, ArXiv.

[28]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[29]  Daniel S. Weld,et al.  S2ORC: The Semantic Scholar Open Research Corpus , 2020, ACL.

[30]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[31]  Chris Dyer,et al.  The NarrativeQA Reading Comprehension Challenge , 2017, TACL.

[32]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[33]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.