Explanations for Data Repair Through Shapley Values

Data repair, i.e., the identification and fix of errors in the data, is a central component of the Data Science cycle. As such, significant research effort has been devoted to automate the repair process. Yet it still requires significant manual labor by the Data Scientists, tweaking and optimizing repair modules (up to 80% of their time, according to surveys). To this end, we propose in this paper a novel framework for explaining the results of any data repair module. Explanations involve identifying the table cells and database constraints having the strongest influence on the process. Influence, in turn, is quantified through the game-theoretic notion of Shapley values, commonly used for explaining Machine Learning classifier results. The main technical challenge is that exact computation of Shapley values incurs exponential time. We consequently devise and optimize novel approximation algorithms, and analyze them both theoretically and empirically. Our results show the efficiency of our approach when compared to the alternative of adapting existing Shapley value computation techniques to the data repair settings.

[1]  Daniel Deutch,et al.  On Multiple Semantics for Declarative Database Repairs , 2020, SIGMOD Conference.

[2]  Walid G. Aref,et al.  EXPLAINER: Entity Resolution Explanations , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[3]  HarzingAnne-Wil Microsoft Academic (Search) , 2016 .

[4]  Mukund Sundararajan,et al.  The many Shapley values for model explanation , 2019, ICML.

[5]  Fei Li,et al.  Constructing an Interactive Natural Language Interface for Relational Databases , 2014, Proc. VLDB Endow..

[6]  Laks V. S. Lakshmanan,et al.  Data Cleaning and Query Answering with Matching Dependencies and Matching Functions , 2010, ICDT '11.

[7]  Dominik Janzing,et al.  Feature relevance quantification in explainable AI: A causality problem , 2019, AISTATS.

[8]  L. Shapley,et al.  Values of Large Games, I , 1977 .

[9]  Sanjay Krishnan,et al.  Wisteria: Nurturing Scalable Data Cleaning Infrastructure , 2015, Proc. VLDB Endow..

[10]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[11]  Jianzhong Li,et al.  Reasoning about Record Matching Rules , 2009, Proc. VLDB Endow..

[12]  Ronald Fagin,et al.  Data exchange: semantics and query answering , 2003, Theor. Comput. Sci..

[13]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[14]  Michael N. Katehakis,et al.  The Multi-Armed Bandit Problem: Decomposition and Computation , 1987, Math. Oper. Res..

[15]  Avishek Saha,et al.  Metric Functional Dependencies , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[16]  Ronald Fagin,et al.  Dichotomies in the Complexity of Preferred Repairs , 2015, PODS.

[17]  S. Lipovetsky,et al.  Analysis of regression in game theory approach , 2001 .

[18]  Daniel Deutch,et al.  T-REx: Table Repair Explanations , 2020, SIGMOD Conference.

[19]  Aleksandrs Slivkins,et al.  Introduction to Multi-Armed Bandits , 2019, Found. Trends Mach. Learn..

[20]  Paolo Papotti,et al.  Descriptive and prescriptive data cleaning , 2014, SIGMOD Conference.

[21]  Paolo Papotti,et al.  Holistic data cleaning: Putting violations into context , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[22]  Doina Precup,et al.  Algorithms for multi-armed bandit problems , 2014, ArXiv.

[23]  Jan Chomicki,et al.  Minimal-change integrity maintenance using tuple deletions , 2002, Inf. Comput..

[24]  Leopoldo E. Bertossi,et al.  Complexity and Approximation of Fixing Numerical Attributes in Databases Under Integrity Constraints , 2005, DBPL.

[25]  Mihalis Yannakakis,et al.  On the Complexity of Database Queries , 1999, J. Comput. Syst. Sci..

[26]  Brandon M. Greenwell,et al.  Interpretable Machine Learning , 2019, Hands-On Machine Learning with R.

[27]  Paolo Papotti,et al.  The LLUNATIC Data-Cleaning Framework , 2013, Proc. VLDB Endow..

[28]  Laure Berti-Équille,et al.  Explaining Automated Data Cleaning with CLeanEX , 2021 .

[29]  Benny Kimelfeld,et al.  Computing Optimal Repairs for Functional Dependencies , 2017, PODS.

[30]  W. Bruce Croft,et al.  Linear feature-based models for information retrieval , 2007, Information Retrieval.

[31]  Phokion G. Kolaitis,et al.  Repair checking in inconsistent databases: algorithms and complexity , 2009, ICDT '09.

[32]  Alexandra Meliou,et al.  Data X-Ray: A Diagnostic Tool for Data Errors , 2015, SIGMOD Conference.

[33]  Floris Geerts,et al.  Explaining Repaired Data with CFDs , 2018, Proc. VLDB Endow..

[34]  Wenfei Fan,et al.  Conditional Functional Dependencies for Data Cleaning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[35]  Erik Strumbelj,et al.  Explaining prediction models and individual predictions with feature contributions , 2014, Knowledge and Information Systems.

[36]  L. Shapley A Value for n-person Games , 1988 .

[37]  Paolo Papotti,et al.  Discovering Denial Constraints , 2013, Proc. VLDB Endow..

[38]  Eugene Wu,et al.  QFix: Diagnosing Errors through Query Histories , 2016, SIGMOD Conference.

[39]  DumaisSusan,et al.  Improving Web Search Ranking by Incorporating User Behavior Information , 2019 .

[40]  Christopher Ré,et al.  The HoloClean Framework Dataset to be cleaned Denial Constraints External Information t 1 t 4 t 2 t 3 Johnnyo ’ s , 2017 .