Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements

Evaluation is a key part of machine learning (ML), yet there is a lack of support and tooling to enable its informed and systematic practice. We introduce Evaluate and Evaluation on the Hub --a set of tools to facilitate the evaluation of models and datasets in ML. Evaluate is a library to support best practices for measurements, metrics, and comparisons of data and models. Its goal is to support reproducibility of evaluation, centralize and document the evaluation process, and broaden evaluation to cover more facets of model performance. It includes over 50 efficient canonical implementations for a variety of domains and scenarios, interactive documentation, and the ability to easily share implementations and outcomes. The library is available at https://github.com/huggingface/evaluate. In addition, we introduce Evaluation on the Hub, a platform that enables the large-scale evaluation of over 75,000 models and 11,000 datasets on the Hugging Face Hub, for free, at the click of a button. Evaluation on the Hub is available at https://huggingface.co/autoevaluate.

[1]  Eric Michael Smith,et al.  Perturbation Augmentation for Fairer NLP , 2022, EMNLP.

[2]  Negar Rostamzadeh,et al.  Evaluation Gaps in Machine Learning Practice , 2022, FAccT.

[3]  Luca Di Liello,et al.  TorchMetrics - Measuring Reproducibility in PyTorch , 2022, J. Open Source Softw..

[4]  Amandalynne Paullada,et al.  AI and the Everything in the Whole Wide World Benchmark , 2021, NeurIPS Datasets and Benchmarks.

[5]  Jason Schultz,et al.  A Framework for Deprecating Datasets: Standardizing Documentation, Identification, and Communication , 2021, Conference on Fairness, Accountability and Transparency.

[6]  Alexander M. Rush,et al.  Datasets: A Community Library for Natural Language Processing , 2021, EMNLP.

[7]  Atsushi Fujita,et al.  Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers , 2021, ACL.

[8]  Weizhe Yuan,et al.  BARTScore: Evaluating Generated Text as Text Generation , 2021, NeurIPS.

[9]  Zhiyi Ma,et al.  Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking , 2021, NeurIPS.

[10]  Graham Neubig,et al.  ExplainaBoard: An Explainable Leaderboard for NLP , 2021, ACL.

[11]  Zhiyi Ma,et al.  Dynabench: Rethinking Benchmarking in NLP , 2021, NAACL.

[12]  Samuel R. Bowman,et al.  What Will it Take to Fix Benchmarking in Natural Language Understanding? , 2021, NAACL.

[13]  Diyi Yang,et al.  The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.

[14]  Mohit Bansal,et al.  Robustness Gym: Unifying the NLP Evaluation Landscape , 2021, NAACL.

[15]  Nicola De Cao,et al.  NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned , 2021, NeurIPS.

[16]  Dan Jurafsky,et al.  Utility Is in the Eye of the User: A Critique of NLP Leaderboard Design , 2020, EMNLP.

[17]  G. Dorffner,et al.  A critical analysis of metrics used for measuring progress in artificial intelligence , 2020, ArXiv.

[18]  Sebastian Ruder,et al.  AxCell: Automatic Extraction of Results from Machine Learning Papers , 2020, EMNLP.

[19]  Joelle Pineau,et al.  Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program) , 2020, J. Mach. Learn. Res..

[20]  Anoop Korattikara Balan,et al.  Measuring the Reliability of Reinforcement Learning Algorithms , 2019, ICLR.

[21]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[22]  Cody Coleman,et al.  MLPerf Inference Benchmark , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[23]  Edward Raff,et al.  A Step Toward Quantifying Independently Reproducible Machine Learning Research , 2019, NeurIPS.

[24]  Johannes L. Schönberger,et al.  SciPy 1.0: fundamental algorithms for scientific computing in Python , 2019, Nature Methods.

[25]  Guido Zuccon,et al.  TrecTools: an Open-source Python Library for Information Retrieval Practitioners Involved in TREC-like Campaigns , 2019, SIGIR.

[26]  Rotem Dror,et al.  Deep Dominance - How to Properly Compare Deep Neural Models , 2019, ACL.

[27]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[28]  Stefan Lee,et al.  EvalAI: Towards Better Evaluation Systems for AI Agents , 2019, ArXiv.

[29]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[30]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[31]  Samuel R. Bowman,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[32]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[33]  Jianfeng Gao,et al.  MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , 2016, CoCo@NIPS.

[34]  Emmanuelle Gouillart,et al.  scikit-image: image processing in Python , 2014, PeerJ.

[35]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[36]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[37]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[38]  H. Escalante,et al.  CodaLab Competitions An Open Source Platform to Organize Scientific Challenges , 2023 .

[39]  Markus Freitag,et al.  Findings of the 2021 Conference on Machine Translation (WMT21) , 2021, WMT.

[40]  Kunle Olukotun,et al.  DAWNBench : An End-to-End Deep Learning Benchmark and Competition , 2017 .

[41]  Skipper Seabold,et al.  Statsmodels: Econometric and Statistical Modeling with Python , 2010, SciPy.

[42]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008 .

[43]  Pushpak Bhattacharyya,et al.  Some Issues in Automatic Evaluation of English-Hindi MT: More Blues for BLEU , 2006 .