Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements
暂无分享,去创建一个
Alexander M. Rush | Leandro von Werra | Margaret Mitchell | Aleksandra Piktus | Julien Chaumond | Nazneen Rajani | Tristan Thrush | Thomas Wolf | A. Thakur | Douwe Kiela | A. Luccioni | Quentin Lhoest | Lewis Tunstall | Mario vSavsko | Omar Sanseviero | Felix Marty | Victor Mustar | Helen Ngo | Albert Villanova
[1] Eric Michael Smith,et al. Perturbation Augmentation for Fairer NLP , 2022, EMNLP.
[2] Negar Rostamzadeh,et al. Evaluation Gaps in Machine Learning Practice , 2022, FAccT.
[3] Luca Di Liello,et al. TorchMetrics - Measuring Reproducibility in PyTorch , 2022, J. Open Source Softw..
[4] Amandalynne Paullada,et al. AI and the Everything in the Whole Wide World Benchmark , 2021, NeurIPS Datasets and Benchmarks.
[5] Jason Schultz,et al. A Framework for Deprecating Datasets: Standardizing Documentation, Identification, and Communication , 2021, Conference on Fairness, Accountability and Transparency.
[6] Alexander M. Rush,et al. Datasets: A Community Library for Natural Language Processing , 2021, EMNLP.
[7] Atsushi Fujita,et al. Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers , 2021, ACL.
[8] Weizhe Yuan,et al. BARTScore: Evaluating Generated Text as Text Generation , 2021, NeurIPS.
[9] Zhiyi Ma,et al. Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking , 2021, NeurIPS.
[10] Graham Neubig,et al. ExplainaBoard: An Explainable Leaderboard for NLP , 2021, ACL.
[11] Zhiyi Ma,et al. Dynabench: Rethinking Benchmarking in NLP , 2021, NAACL.
[12] Samuel R. Bowman,et al. What Will it Take to Fix Benchmarking in Natural Language Understanding? , 2021, NAACL.
[13] Diyi Yang,et al. The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.
[14] Mohit Bansal,et al. Robustness Gym: Unifying the NLP Evaluation Landscape , 2021, NAACL.
[15] Nicola De Cao,et al. NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned , 2021, NeurIPS.
[16] Dan Jurafsky,et al. Utility Is in the Eye of the User: A Critique of NLP Leaderboard Design , 2020, EMNLP.
[17] G. Dorffner,et al. A critical analysis of metrics used for measuring progress in artificial intelligence , 2020, ArXiv.
[18] Sebastian Ruder,et al. AxCell: Automatic Extraction of Results from Machine Learning Papers , 2020, EMNLP.
[19] Joelle Pineau,et al. Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program) , 2020, J. Mach. Learn. Res..
[20] Anoop Korattikara Balan,et al. Measuring the Reliability of Reinforcement Learning Algorithms , 2019, ICLR.
[21] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[22] Cody Coleman,et al. MLPerf Inference Benchmark , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).
[23] Edward Raff,et al. A Step Toward Quantifying Independently Reproducible Machine Learning Research , 2019, NeurIPS.
[24] Johannes L. Schönberger,et al. SciPy 1.0: fundamental algorithms for scientific computing in Python , 2019, Nature Methods.
[25] Guido Zuccon,et al. TrecTools: an Open-source Python Library for Information Retrieval Practitioners Involved in TREC-like Campaigns , 2019, SIGIR.
[26] Rotem Dror,et al. Deep Dominance - How to Properly Compare Deep Neural Models , 2019, ACL.
[27] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.
[28] Stefan Lee,et al. EvalAI: Towards Better Evaluation Systems for AI Agents , 2019, ArXiv.
[29] Inioluwa Deborah Raji,et al. Model Cards for Model Reporting , 2018, FAT.
[30] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.
[31] Samuel R. Bowman,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[32] Timnit Gebru,et al. Datasheets for datasets , 2018, Commun. ACM.
[33] Jianfeng Gao,et al. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , 2016, CoCo@NIPS.
[34] Emmanuelle Gouillart,et al. scikit-image: image processing in Python , 2014, PeerJ.
[35] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..
[36] Wiebke Wagner,et al. Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.
[37] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[38] H. Escalante,et al. CodaLab Competitions An Open Source Platform to Organize Scientific Challenges , 2023 .
[39] Markus Freitag,et al. Findings of the 2021 Conference on Machine Translation (WMT21) , 2021, WMT.
[40] Kunle Olukotun,et al. DAWNBench : An End-to-End Deep Learning Benchmark and Competition , 2017 .
[41] Skipper Seabold,et al. Statsmodels: Econometric and Statistical Modeling with Python , 2010, SciPy.
[42] Aric Hagberg,et al. Exploring Network Structure, Dynamics, and Function using NetworkX , 2008 .
[43] Pushpak Bhattacharyya,et al. Some Issues in Automatic Evaluation of English-Hindi MT: More Blues for BLEU , 2006 .