A Unified Framework for Task-Driven Data Quality Management

High-quality data is critical to train performant Machine Learning (ML) models, highlighting the importance of Data Quality Management (DQM). Existing DQM schemes often cannot satisfactorily improve ML performance because, by design, they are oblivious to downstream ML tasks. Besides, they cannot handle various data quality issues (especially those caused by adversarial attacks) and have limited applications to only certain types of ML models. Recently, data valuation approaches (e.g., based on the Shapley value) have been leveraged to perform DQM; yet, empirical studies have observed that their performance varies considerably based on the underlying data and training process. In this paper, we propose a task-driven, multi-purpose, model-agnostic DQM framework, DATASIFTER, which is optimized towards a given downstream ML task, capable of effectively removing data points with various defects, and applicable to diverse models. Specifically, we formulate DQM as an optimization problem and devise a scalable algorithm to solve it. Furthermore, we propose a theoretical framework for comparing the worst-case performance of different DQM strategies. Remarkably, our results show that the popular strategy based on the Shapley value may end up choosing the worst data subset in certain practical scenarios. Our evaluation shows that DATASIFTER achieves and most often significantly improves the state-of-the-art performance over a wide range of DQM tasks, including backdoor, poison, noisy/mislabel data detection, data summarization, and data debiasing.

[1]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[2]  Maria-Florina Balcan,et al.  Learning submodular functions , 2010, ECML/PKDD.

[3]  Rishabh K. Iyer,et al.  Submodularity in Data Subset Selection and Active Learning , 2015, ICML.

[4]  Dawn Song,et al.  A Principled Approach to Data Valuation for Federated Learning , 2020, Federated Learning.

[5]  Ariel D. Procaccia,et al.  If You Like Shapley Then You'll Love the Core , 2021, AAAI.

[6]  James Y. Zou,et al.  Data Shapley: Equitable Valuation of Data for Machine Learning , 2019, ICML.

[7]  Jishen Zhao,et al.  DeepInspect: A Black-box Trojan Detection and Mitigation Framework for Deep Neural Networks , 2019, IJCAI.

[8]  Abhimanyu Das,et al.  Approximate Submodularity and its Applications: Subset Selection, Sparse Approximation and Dictionary Selection , 2018, J. Mach. Learn. Res..

[9]  Brendan Dolan-Gavitt,et al.  BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain , 2017, ArXiv.

[10]  Dawn Song,et al.  Robust Anomaly Detection and Backdoor Attack Detection Via Differential Privacy , 2019, ICLR.

[11]  Ben Y. Zhao,et al.  Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[12]  Costas J. Spanos,et al.  Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms , 2019, Proc. VLDB Endow..

[13]  Toniann Pitassi,et al.  Learning Adversarially Fair and Transferable Representations , 2018, ICML.

[14]  Pradeep Ravikumar,et al.  Representer Point Selection for Explaining Deep Neural Networks , 2018, NeurIPS.

[15]  Robert E. Mercer,et al.  Classifying Spam Emails Using Text and Readability Features , 2013, 2013 IEEE 13th International Conference on Data Mining.

[16]  Frederick Liu,et al.  Estimating Training Data Influence by Tracking Gradient Descent , 2020, NeurIPS.

[17]  Sasan Maleki,et al.  Addressing the computational issues of the Shapley value with applications in the smart grid , 2015 .

[18]  Michel Minoux,et al.  Accelerated greedy algorithms for maximizing submodular set functions , 1978 .

[19]  Paolo Papotti,et al.  KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing , 2015, SIGMOD Conference.

[20]  Ruoxi Jia,et al.  One-Round Active Learning , 2021, ArXiv.

[21]  Felix Bießmann,et al.  Unit Testing Data with Deequ , 2019, SIGMOD Conference.

[22]  Abraham P. Punnen,et al.  The travelling salesman problem: new solvable cases and linkages with the development of approximation algorithms , 1997 .

[23]  Hongjun Yoon A Machine Learning Evaluation of the COMPAS Dataset , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[24]  Ruoxi Jia,et al.  Rethinking the Backdoor Attacks' Triggers: A Frequency Perspective , 2021, ArXiv.

[25]  Yaron Singer,et al.  Maximization of Approximately Submodular Functions , 2016, NIPS.

[26]  Sanjay Krishnan,et al.  AlphaClean: Automatic Generation of Data Cleaning Pipelines , 2019, ArXiv.

[27]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[28]  Toniann Pitassi,et al.  Learning Fair Representations , 2013, ICML.

[29]  Sanjay Krishnan,et al.  BoostClean: Automated Error Detection and Repair for Machine Learning , 2017, ArXiv.

[30]  Ziawasch Abedjan,et al.  From Cleaning before ML to Cleaning for ML , 2021, IEEE Data Eng. Bull..

[31]  Andreas Krause,et al.  Lazier Than Lazy Greedy , 2014, AAAI.

[32]  Xu Chu,et al.  Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions , 2020, Proc. VLDB Endow..

[33]  Harald C. Gall,et al.  Software Engineering for Machine Learning: A Case Study , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[34]  Anirban Dasgupta,et al.  On Additive Approximate Submodularity , 2020, Theor. Comput. Sci..

[35]  Dawn Song,et al.  Scalability vs. Utility: Do We Have to Sacrifice One for the Other in Data Importance Quantification? , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Tudor Dumitras,et al.  Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks , 2018, NeurIPS.

[37]  Avinatan Hassidim,et al.  Optimization for Approximate Submodularity , 2018, NeurIPS.

[38]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[39]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[40]  P. Xie,et al.  COVID-CT-Dataset: A CT Scan Dataset about COVID-19 , 2020, ArXiv.

[41]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[42]  Yisong Yue,et al.  Learning to Make Decisions via Submodular Regularization , 2021, ICLR.

[43]  Jeffrey Heer,et al.  Wrangler: interactive visual specification of data transformation scripts , 2011, CHI.

[44]  David Cox,et al.  Scaling up biologically-inspired computer vision: A case study in unconstrained face recognition on facebook , 2011, CVPR 2011 WORKSHOPS.

[45]  Sanjay Krishnan,et al.  ActiveClean: Interactive Data Cleaning For Statistical Modeling , 2016, Proc. VLDB Endow..

[46]  Costas J. Spanos,et al.  Towards Efficient Data Valuation Based on the Shapley Value , 2019, AISTATS.