论文信息 - Datasets: A Community Library for Natural Language Processing - 字舞流文

Datasets: A Community Library for Natural Language Processing

The scale, variety, and quantity of publiclyavailable NLP datasets has grown rapidly as researchers propose new tasks, larger models, and novel benchmarks. Datasets is a community library for contemporary NLP designed to support this ecosystem. Datasets aims to standardize end-user interfaces, versioning, and documentation, while providing a lightweight front-end that behaves similarly for small datasets as for internet-scale corpora. The design of the library incorporates a distributed, community-driven approach to adding datasets and documenting usage. After a year of development, the library now includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel crossdataset research projects and shared tasks. The library is available at https://github. com/huggingface/datasets.

Alexander M. Rush | Canwen Xu | Julien Plu | Thomas Wolf | Lysandre Debut | Julien Chaumond | Pierric Cistac | Teven Le Scao | Stas Bekman | Yacine Jernite | Philipp Schmid | Patrick von Platen | Angelina McMillan-Major | Joe Davison | Abhishek Thakur | Sylvain Gugger | Simon Brandeis | Gunjan Chhablani | Victor Sanh | Suraj Patil | Franccois Lagunas | Nicolas Patry | Quentin Lhoest | Albert Villanova del Moral | Mariama Drame | Lewis Tunstall | Mario vSavsko | Bhavitvya Malik | Cl'ement Delangue | Th'eo Matussiere | Thibault Goehringer | Victor Mustar | Yacine Jernite | Thomas Wolf | Philipp Schmid | Julien Chaumond | Clement Delangue | Lysandre Debut | Pierric Cistac | A. Thakur | Joe Davison | Julien Plu | Gunjan Chhablani | François Lagunas | Canwen Xu | Victor Sanh | Quentin Lhoest | Angelina McMillan-Major | Lewis Tunstall | Stas Bekman | Mario vSavsko | N. Patry | Victor Mustar | Sylvain Gugger | Suraj Patil | Mariama Drame | Bhavitvya Malik | Simon Brandeis | Th'eo Matussiere | Thibault Goehringer | Victor Sanh | J. Plu

[1] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[2] Mitchell P. Marcus,et al. OntoNotes: The 90% Solution , 2006, NAACL.

[3] Yonatan Belinkov,et al. Synthetic and Natural Noise Both Break Neural Machine Translation , 2017, ICLR.

[4] Omer Levy,et al. Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[5] John Cocke,et al. A Statistical Approach to Language Translation , 1988, COLING.

[6] Sampo Pyysalo,et al. Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[7] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[8] Richard Socher,et al. The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[9] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[10] Sebastian Gehrmann,et al. Reusable Templates and Guides For Documenting Datasets and Models for Natural Language Processing and Generation: A Case Study of the HuggingFace and GEM Data and Model Cards , 2021, GEM.

[11] Jeff Johnson,et al. Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[12] Fabio Petroni,et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[13] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[14] Aurko Roy,et al. Hurdles to Progress in Long-form Question Answering , 2021, NAACL.

[15] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16] Emily Denton,et al. Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure , 2020, FAccT.

[17] Timnit Gebru,et al. Datasheets for datasets , 2018, Commun. ACM.

[18] Diyi Yang,et al. The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.

[19] Jannis Bulian,et al. CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims , 2020, ArXiv.

[20] Emily M. Bender,et al. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science , 2018, TACL.

[21] Philipp Koehn,et al. Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[22] Steven Bird,et al. NLTK: The Natural Language Toolkit , 2002, ACL.

[23] Rachel Rudinger,et al. Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[24] Jason Weston,et al. ELI5: Long Form Question Answering , 2019, ACL.

[25] Jörg Tiedemann,et al. The OPUS Corpus - Parallel and Free: http://logos.uio.no/opus , 2004, LREC.

[26] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[27] Sabine Buchholz,et al. Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[28] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[29] Mohit Bansal,et al. Robustness Gym: Unifying the NLP Evaluation Landscape , 2021, NAACL.

[30] Emily Denton,et al. Social Biases in NLP Models as Barriers for Persons with Disabilities , 2020, ACL.

[31] Gennady Pekhimenko,et al. Distributed Deep Learning in Open Collaborations , 2021, NeurIPS.