Complaint-driven Training Data Debugging for Query 2.0

As the need for machine learning (ML) increases rapidly across all industry sectors, there is a significant interest among commercial database providers to support "Query 2.0", which integrates model inference into SQL queries. Debugging Query 2.0 is very challenging since an unexpected query result may be caused by the bugs in training data (e.g., wrong labels, corrupted features). In response, we propose Rain, a complaint-driven training data debugging system. Rain allows users to specify complaints over the query's intermediate or final output, and aims to return a minimum set of training examples so that if they were removed, the complaints would be resolved. To the best of our knowledge, we are the first to study this problem. A naive solution requires retraining an exponential number of ML models. We propose two novel heuristic approaches based on influence functions which both require linear retraining steps. We provide an in-depth analytical and empirical analysis of the two approaches and conduct extensive experiments to evaluate their effectiveness using four real-world datasets. Results show that Rain achieves the highest recall@k among all the baselines while still returns results interactively.

[1]  Dan Suciu,et al.  Explaining Query Answers with Explanation-Ready Databases , 2015, Proc. VLDB Endow..

[2]  Jian Li,et al.  Sensitivity analysis and explanations for robust query evaluation in probabilistic databases , 2011, SIGMOD '11.

[3]  Daniel Deutch,et al.  Provenance for aggregate queries , 2011, PODS.

[4]  Samuel Madden,et al.  Scorpion: Explaining Away Outliers in Aggregate Queries , 2013, Proc. VLDB Endow..

[5]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[6]  Xin Zhang,et al.  TFX: A TensorFlow-Based Production-Scale Machine Learning Platform , 2017, KDD.

[7]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[8]  Chris Jermaine,et al.  Declarative Recursive Computation on an RDBMS , 2019, Proc. VLDB Endow..

[9]  Jure Leskovec,et al.  Interpretable Decision Sets: A Joint Framework for Description and Prediction , 2016, KDD.

[10]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[11]  Quoc Trung Tran,et al.  How to ConQueR why-not questions , 2010, SIGMOD Conference.

[12]  Neoklis Polyzotis,et al.  Data Management Challenges in Production Machine Learning , 2017, SIGMOD Conference.

[13]  Christopher Ré,et al.  The HoloClean Framework Dataset to be cleaned Denial Constraints External Information t 1 t 4 t 2 t 3 Johnnyo ’ s , 2017 .

[14]  Dan Suciu,et al.  Reverse data management , 2011, Proc. VLDB Endow..

[15]  Yue Zhang,et al.  CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis] , 2019, ArXiv.

[16]  Jian Pei,et al.  Cleaning Crowdsourced Labels Using Oracles For Statistical Classification , 2018, Proc. VLDB Endow..

[17]  Dan Suciu,et al.  HypDB: A Demonstration of Detecting, Explaining and Resolving Bias in OLAP queries , 2018, Proc. VLDB Endow..

[18]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[19]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Dan Suciu,et al.  A formal approach to finding explanations for database queries , 2014, SIGMOD Conference.

[21]  Eugene Wu,et al.  QFix: Diagnosing Errors through Query Histories , 2016, SIGMOD Conference.

[22]  Percy Liang,et al.  On the Accuracy of Influence Functions for Measuring Group Effects , 2019, NeurIPS.

[23]  Kush R. Varshney,et al.  Optimized Pre-Processing for Discrimination Prevention , 2017, NIPS.

[24]  Dan Suciu,et al.  Interventional Fairness: Causal Database Repair for Algorithmic Fairness , 2019, SIGMOD Conference.

[25]  ColyerAdrian Putting Machine Learning into Production Systems , 2019 .

[26]  Alexandra Meliou,et al.  Data X-Ray: A Diagnostic Tool for Data Errors , 2015, SIGMOD Conference.

[27]  Eugene Wu,et al.  DeepBase: Deep Inspection of Neural Networks , 2018, SIGMOD Conference.

[28]  Paolo Papotti,et al.  Holistic data cleaning: Putting violations into context , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[29]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[30]  Theodoros Rekatsinas,et al.  HoloDetect: Few-Shot Learning for Error Detection , 2019, SIGMOD Conference.

[31]  Guy Van den Broeck,et al.  A Semantic Loss Function for Deep Learning with Symbolic Knowledge , 2017, ICML.

[32]  Meihui Zhang,et al.  Reverse Engineering Aggregation Queries , 2017, Proc. VLDB Endow..

[33]  Sanjay Krishnan,et al.  PALM: Machine Learning Explanations For Iterative Debugging , 2017, HILDA@SIGMOD.

[34]  Sanjay Krishnan,et al.  BoostClean: Automated Error Detection and Repair for Machine Learning , 2017, ArXiv.

[35]  Sherif Sakr,et al.  Automated Machine Learning: State-of-The-Art and Open Challenges , 2019, ArXiv.

[36]  Eugene Wu,et al.  Leveraging Quality Prediction Models for Automatic Writing Feedback , 2017, ICWSM.

[37]  Dan Suciu,et al.  Causality and Explanations in Databases , 2014, Proc. VLDB Endow..

[38]  Christopher Ré,et al.  Snorkel: Fast Training Set Generation for Information Extraction , 2017, SIGMOD Conference.

[39]  Dan Suciu,et al.  Tiresias: the database oracle for how-to queries , 2012, SIGMOD Conference.

[40]  Takanori Maehara,et al.  Data Cleansing for Models Trained with SGD , 2019, NeurIPS.

[41]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[42]  Oluwasanmi Koyejo,et al.  Interpreting Black Box Predictions using Fisher Kernels , 2018, AISTATS.

[43]  Michael Stonebraker,et al.  Detecting Data Errors: Where are we and what needs to be done? , 2016, Proc. VLDB Endow..

[44]  Aakanksha Chowdhery,et al.  Accelerating Machine Learning Inference with Probabilistic Predicates , 2018, SIGMOD Conference.

[45]  Sanjay Krishnan,et al.  AlphaClean: Automatic Generation of Data Cleaning Pipelines , 2019, ArXiv.

[46]  Peter Bailis,et al.  Model Assertions for Monitoring and Improving ML Models , 2020, MLSys.

[47]  Jun Zhao,et al.  Distant Supervision for Relation Extraction via Piecewise Convolutional Neural Networks , 2015, EMNLP.

[48]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[49]  Tim Kraska,et al.  MLbase: A Distributed Machine-learning System , 2013, CIDR.

[50]  J. Doug Tygar,et al.  Evasion and Hardening of Tree Ensemble Classifiers , 2015, ICML.

[51]  Michael Stonebraker,et al.  Raha: A Configuration-Free Error Detection System , 2019, SIGMOD Conference.