Explaining Inference Queries with Bayesian Optimization

Obtaining an explanation for an SQL query result can enrich the analysis experience, reveal data errors, and provide deeper insight into the data. Inference query explanation seeks to explain unexpected aggregate query results on inference data; such queries are challenging to explain because an explanation may need to be derived from the source, training, or inference data in an ML pipeline. In this paper, we model an objective function as a black-box function and propose BOExplain, a novel framework for explaining inference queries using Bayesian optimization (BO). An explanation is a predicate defining the input tuples that should be removed so that the query result of interest is significantly affected. BO --- a technique for finding the global optimum of a black-box function --- is used to find the best predicate. We develop two new techniques (individual contribution encoding and warm start) to handle categorical variables. We perform experiments showing that the predicates found by BOExplain have a higher degree of explanation compared to those found by the state-of-the-art query explanation engines. We also show that BOExplain is effective at deriving explanations for inference queries from source and training data on a variety of real-world datasets. BOExplain is open-sourced as a Python package at https://github.com/sfu-db/BOExplain.

[1]  D. Lizotte Practical bayesian optimization , 2008 .

[2]  Dan Suciu,et al.  Explaining Query Answers with Explanation-Ready Databases , 2015, Proc. VLDB Endow..

[3]  Manish Kumar,et al.  PerfAugur: Robust diagnostics for performance anomalies in cloud services , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[4]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[5]  Yi Lin,et al.  Prediction Cubes , 2005, VLDB.

[6]  Ameet Talwalkar,et al.  Random Search and Reproducibility for Neural Architecture Search , 2019, UAI.

[7]  Peter I. Frazier,et al.  A Tutorial on Bayesian Optimization , 2018, ArXiv.

[8]  Eduardo C. Garrido-Merchán,et al.  Dealing with Categorical and Integer-valued Variables in Bayesian Optimization with Gaussian Processes , 2017, Neurocomputing.

[9]  Juliana Freire,et al.  BugDoc: A System for Debugging Computational Pipelines , 2020, SIGMOD Conference.

[10]  Santu Rana,et al.  Bayesian Optimization for Categorical and Category-Specific Continuous Inputs , 2019, AAAI.

[11]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[12]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[13]  Aaron Klein,et al.  Hyperparameter Optimization , 2017, Encyclopedia of Machine Learning and Data Mining.

[14]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[15]  Eugene Wu,et al.  Complaint-driven Training Data Debugging for Query 2.0 , 2020, SIGMOD Conference.

[16]  Jeffrey F. Naughton,et al.  DIFF: a relational interface for large-scale data explanation , 2018, The VLDB Journal.

[17]  Dan Suciu,et al.  A formal approach to finding explanations for database queries , 2014, SIGMOD Conference.

[18]  Michael A. Osborne,et al.  Bayesian Optimisation over Multiple Continuous and Categorical Inputs , 2019, ICML.

[19]  Takuya Akiba,et al.  Optuna: A Next-generation Hyperparameter Optimization Framework , 2019, KDD.

[20]  David D. Cox,et al.  Hyperopt: A Python Library for Optimizing the Hyperparameters of Machine Learning Algorithms , 2013, SciPy.

[21]  Matthias Poloczek,et al.  Bayesian Optimization of Combinatorial Structures , 2018, ICML.

[22]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[23]  Prabhat,et al.  Scalable Bayesian Optimization Using Deep Neural Networks , 2015, ICML.

[24]  Donald R. Jones,et al.  Global versus local search in constrained optimization of computer models , 1998 .

[25]  Alexandra Meliou,et al.  Data X-Ray: A Diagnostic Tool for Data Errors , 2015, SIGMOD Conference.

[26]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[27]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[28]  Fabio Casati,et al.  Crowdsourced dataset to study the generation and impact of text highlighting in classification tasks , 2019, BMC Research Notes.

[29]  Sunita Sarawagi,et al.  i3: intelligent, interactive investigation of OLAP data cubes , 2000, SIGMOD '00.

[30]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[31]  Samuel Madden,et al.  Scorpion: Explaining Away Outliers in Aggregate Queries , 2013, Proc. VLDB Endow..

[32]  Tianyin Xu,et al.  EnCore: exploiting system environment and correlation information for misconfiguration detection , 2014, ASPLOS.

[33]  Sunita Sarawagi,et al.  Intelligent Rollups in Multidimensional OLAP Data , 2001, VLDB.

[34]  Gerhard Satzger,et al.  Handling Concept Drifts in Regression Problems - the Error Intersection Approach , 2020, Wirtschaftsinformatik.

[35]  Benjamin Recht,et al.  Simple random search provides a competitive approach to reinforcement learning , 2018, ArXiv.

[36]  Barzan Mozafari,et al.  DBSherlock: A Performance Diagnostic Tool for Transactional Databases , 2016, SIGMOD Conference.

[37]  Boris Glavic,et al.  Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances , 2019, SIGMOD Conference.

[38]  Cyrille Artho,et al.  Iterative delta debugging , 2009, International Journal on Software Tools for Technology Transfer.

[39]  Dean R. De Cock,et al.  Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project , 2011 .

[40]  Elena Baralis,et al.  Looking for Trouble: Analyzing Classifier Behavior via Pattern Divergence , 2021, SIGMOD Conference.

[41]  Jasper Snoek,et al.  Bayesian Optimization and Semiparametric Models with Applications to Assistive Technology , 2014 .

[42]  Andrew Gordon Wilson,et al.  Student-t Processes as Alternatives to Gaussian Processes , 2014, AISTATS.

[43]  Jakub M. Tomczak,et al.  Combinatorial Bayesian Optimization using the Graph Cartesian Product , 2019, NeurIPS.

[44]  Kevin Leyton-Brown,et al.  Sequential Model-Based Optimization for General Algorithm Configuration , 2011, LION.

[45]  Baishakhi Ray,et al.  CADET: A Systematic Method For Debugging Misconfigurations using Counterfactual Reasoning , 2020, ArXiv.

[46]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[47]  Janardhan Rao Doppa,et al.  Scalable Combinatorial Bayesian Optimization with Tractable Statistical models , 2020, ArXiv.

[48]  Samuel Madden,et al.  MacroBase: Prioritizing Attention in Fast Data , 2016, SIGMOD Conference.

[49]  Dan Suciu,et al.  PerfXplain: Debugging MapReduce Job Performance , 2012, Proc. VLDB Endow..

[50]  Tim Kraska,et al.  Slice Finder: Automated Data Slicing for Model Validation , 2018, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[51]  Dan Suciu,et al.  Causality and Explanations in Databases , 2014, Proc. VLDB Endow..

[52]  Nimrod Megiddo,et al.  Discovery-Driven Exploration of OLAP Data Cubes , 1998, EDBT.

[53]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[54]  Peter Triantafillou,et al.  Explaining Aggregates for Exploratory Analytics , 2018, 2018 IEEE International Conference on Big Data (Big Data).