DataPerf: Benchmarks for Data-Centric AI Development

Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing dataset benchmarks. In response, we present DataPerf, a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We aim to foster innovation in data-centric AI through competition, comparability, and reproducibility. We enable the ML community to iterate on datasets, instead of just architectures, and we provide an open, online platform with multiple rounds of challenges to support this iterative development. The first iteration of DataPerf contains five benchmarks covering a wide spectrum of data-centric techniques, tasks, and modalities in vision, speech, acquisition, debugging, and diffusion prompting, and we support hosting new contributed benchmarks from the community. The benchmarks, online evaluation platform, and baseline implementations are open source, and the MLCommons Association will maintain DataPerf to ensure long-term benefits to academia and industry.

[1]  Pang Wei Koh,et al.  DataComp: In search of the next generation of multimodal datasets , 2023, ArXiv.

[2]  Zaid Pervaiz Bhat,et al.  Data-centric Artificial Intelligence: A Survey , 2023, ArXiv.

[3]  Florian Tramèr,et al.  Red-Teaming the Stable Diffusion Safety Filter , 2022, ArXiv.

[4]  B. Schiele,et al.  USB: A Unified Semi-supervised Learning Benchmark , 2022, NeurIPS.

[5]  A. Valros,et al.  Is one annotation enough? A data-centric image classification benchmark for noisy and ambiguous label estimation , 2022, NeurIPS.

[6]  James Y. Zou,et al.  dcbench: a benchmark for data-centric AI systems , 2022, DEEM@SIGMOD.

[7]  Hannah Rose Kirk,et al.  Handling and Presenting Harmful Text in NLP Research , 2022, EMNLP.

[8]  Sebastian Schelter,et al.  Data Debugging with Shapley Importance over End-to-End Machine Learning Pipelines , 2022, ArXiv.

[9]  Jared A. Dunnmon,et al.  Domino: Discovering Systematic Errors with Cross-Modal Embeddings , 2022, ICLR.

[10]  Emily Denton,et al.  Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research , 2021, NeurIPS Datasets and Benchmarks.

[11]  Amandalynne Paullada,et al.  AI and the Everything in the Whole Wide World Benchmark , 2021, NeurIPS Datasets and Benchmarks.

[12]  Lora Aroyo,et al.  Data Excellence for AI: Why Should You Care , 2021, ArXiv.

[13]  Virginia Smith,et al.  Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines , 2021, MLSys.

[14]  L. Nanni,et al.  Comparison of Different Image Data Augmentation Approaches , 2021, J. Imaging.

[15]  J. Rahnenführer,et al.  Benchmark of filter methods for feature selection in high-dimensional gene expression survival data , 2021, Briefings Bioinform..

[16]  Praveen K. Paritosh,et al.  “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI , 2021, CHI.

[17]  Zhiyi Ma,et al.  Dynabench: Rethinking Benchmarking in NLP , 2021, NAACL.

[18]  Vijay Janapa Reddi,et al.  Few-Shot Keyword Spotting in Any Language , 2021, Interspeech.

[19]  Ce Zhang,et al.  CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks , 2021, 2021 IEEE 37th International Conference on Data Engineering (ICDE).

[20]  Jonas Mueller,et al.  Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks , 2021, NeurIPS Datasets and Benchmarks.

[21]  D. Murray,et al.  tf.data: A Machine Learning Data Processing Framework , 2021, Proc. VLDB Endow..

[22]  Amandalynne Paullada,et al.  Data and its (dis)contents: A survey of dataset development and use in machine learning research , 2020, Patterns.

[23]  Kellyn F Arnold,et al.  Time to reality check the promises of machine learning-powered precision medicine , 2020, The Lancet. Digital health.

[24]  Amar Phanishayee,et al.  Analyzing and Mitigating Data Stalls in DNN Training , 2020, Proc. VLDB Endow..

[25]  Emily Denton,et al.  Bringing the People Back In: Contesting Benchmark Machine Learning Datasets , 2020, ArXiv.

[26]  Xiaoyu Wang,et al.  AIBench Training: Balanced Industry-Standard AI Training Benchmarking , 2020, 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[27]  Prithviraj Sen,et al.  A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching , 2020, SIGMOD Conference.

[28]  Ankit Patel,et al.  Missing the Forest for the Trees: End-to-End AI Application Performance in Edge Data Centers , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[29]  Cody Coleman,et al.  MLPerf Inference Benchmark , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[30]  Lora Aroyo,et al.  Metrology for AI: From Benchmarks to Instruments , 2019, ArXiv.

[31]  Isaac L. Chuang,et al.  Confident Learning: Estimating Uncertainty in Dataset Labels , 2019, J. Artif. Intell. Res..

[32]  Cody A. Coleman,et al.  MLPerf Training Benchmark , 2019, MLSys.

[33]  Kristina Lerman,et al.  A Survey on Bias and Fairness in Machine Learning , 2019, ACM Comput. Surv..

[34]  Yoav Goldberg,et al.  Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets , 2019, EMNLP.

[35]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[36]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[37]  Thomas L. Griffiths,et al.  Human Uncertainty Makes Classification More Robust , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Yonatan Belinkov,et al.  Don’t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference , 2019, ACL.

[39]  Baharan Mirzasoleiman,et al.  Selection Via Proxy: Efficient Data Selection For Deep Learning , 2019, ICLR.

[40]  Marco F. Huber,et al.  Benchmark and Survey of Automated Machine Learning Frameworks , 2019, J. Artif. Intell. Res..

[41]  James Y. Zou,et al.  Data Shapley: Equitable Valuation of Data for Machine Learning , 2019, ICML.

[42]  Nezihe Merve Gürel,et al.  Towards Efficient Data Valuation Based on the Shapley Value , 2019, AISTATS.

[43]  Fan Zhang,et al.  AIBench: Towards Scalable and Comprehensive Datacenter AI Benchmarking , 2018, Bench.

[44]  Jordi Pont-Tuset,et al.  The Open Images Dataset V4 , 2018, International Journal of Computer Vision.

[45]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[46]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[47]  Masatoshi Tsuchiya,et al.  Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment , 2018, LREC.

[48]  Amar Phanishayee,et al.  TBD: Benchmarking and Analyzing Deep Neural Network Training , 2018, ArXiv.

[49]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[50]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[51]  Perry Cheng,et al.  Serverless Computing: Current Trends and Open Problems , 2017, Research Advances in Cloud Computing.

[52]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[53]  Tailin Wu,et al.  Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels , 2017, UAI.

[54]  Dirk Weissenborn,et al.  Making Neural QA as Simple as Possible but not Simpler , 2017, CoNLL.

[55]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[56]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[57]  D. Sculley,et al.  Hidden Technical Debt in Machine Learning Systems , 2015, NIPS.

[58]  Lukasz Swiatek,et al.  “Amazon” , 2013, IIC - International Review of Intellectual Property and Competition Law.

[59]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[61]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[62]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[63]  David Kanter,et al.  The Dollar Street Dataset: Images Representing the Geographic and Socioeconomic Diversity of the World , 2022, NeurIPS.

[64]  C. Bennet,et al.  Fostering reproducible fMRI research , 2017, Nature Neuroscience.

[65]  Colby R. Banbury,et al.  Multilingual Spoken Words Corpus , 2021, NeurIPS Datasets and Benchmarks.

[66]  Robert Ilijason Databricks , 2021, Getting Started with Databricks.

[67]  Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society , 2019, AIES.

[68]  Kunle Olukotun,et al.  DAWNBench : An End-to-End Deep Learning Benchmark and Competition , 2017 .

[69]  and as an in , 2022 .