Data Quality and Explainable AI

In this work, we provide some insights and develop some ideas, with few technical details, about the role of explanations in Data Quality in the context of data-based machine learning models (ML). In this direction, there are, as expected, roles for causality, and explainable artificial intelligence. The latter area not only sheds light on the models, but also on the data that support model construction. There is also room for defining, identifying, and explaining errors in data, in particular, in ML, and also for suggesting repair actions. More generally, explanations can be used as a basis for defining dirty data in the context of ML, and measuring or quantifying them. We think dirtiness as relative to the ML task at hand, e.g., classification.

[1]  Rayid Ghani,et al.  Aequitas: A Bias and Fairness Audit Toolkit , 2018, ArXiv.

[2]  Theodoros Rekatsinas,et al.  HoloDetect: Few-Shot Learning for Error Detection , 2019, SIGMOD Conference.

[3]  Sanjay Krishnan,et al.  BoostClean: Automated Error Detection and Repair for Machine Learning , 2017, ArXiv.

[4]  Jianfeng Du,et al.  A Tractable Approach to ABox Abduction over Description Logic Ontologies , 2014, AAAI.

[5]  Diego Calvanese,et al.  Enriching Ontology-based Data Access with Provenance , 2019, IJCAI.

[6]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[7]  Hung Q. Ngo,et al.  AC/DC: In-Database Learning Thunderstruck , 2018, DEEM@SIGMOD.

[8]  Pradeep Dubey,et al.  Mathematical Properties of the Banzhaf Power Index , 1979, Math. Oper. Res..

[9]  Floris Geerts,et al.  Explaining Repaired Data with CFDs , 2018, Proc. VLDB Endow..

[10]  Joseph Y. Halpern,et al.  Causes and Explanations: A Structural-Model Approach. Part I: Causes , 2000, The British Journal for the Philosophy of Science.

[11]  Dan Suciu,et al.  Bias in OLAP Queries: Detection, Explanation, and Removal , 2018, SIGMOD Conference.

[12]  John Mylopoulos,et al.  Towards a Compositional Semantic Account of Data Quality Attributes , 2008, ER.

[13]  Leopoldo E. Bertossi,et al.  Ontological Multidimensional Data Models and Contextual Data Quality , 2017, ACM J. Data Inf. Qual..

[14]  Leopoldo E. Bertossi,et al.  Causes for query answers from databases: Datalog abduction, view-updates, and integrity constraints , 2016, Int. J. Approx. Reason..

[15]  Dan Suciu,et al.  The Complexity of Causality and Responsibility for Query Answers and non-Answers , 2010, Proc. VLDB Endow..

[16]  Cynthia Rudin,et al.  Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[17]  Cynthia Rudin,et al.  An Interpretable Model with Globally Consistent Explanations for Credit Risk , 2018, ArXiv.

[18]  Leopoldo E. Bertossi,et al.  The Shapley Value of Tuples in Query Answering , 2019, ICDT.

[19]  Lise Getoor,et al.  Collective entity resolution in multi-relational familial networks , 2018, Knowledge and Information Systems.

[20]  Erez Shmueli,et al.  Algorithmic Fairness , 2020, ArXiv.

[21]  Christopher Ré,et al.  A Relational Framework for Classifier Engineering , 2018, SGMD.

[22]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[23]  Yair Zick,et al.  Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[24]  Guy Van den Broeck,et al.  Quantifying Causal Effects on Query Answering in Databases , 2016, TaPP.

[25]  L. Shapley,et al.  The Shapley Value , 1994 .

[26]  Dan Suciu,et al.  Data Management for Causal Algorithmic Fairness , 2019, IEEE Data Eng. Bull..

[27]  Felix Naumann,et al.  Transforming Pairwise Duplicates to Entity Clusters for High-quality Duplicate Detection , 2019, JDIQ.

[28]  Dan Suciu,et al.  Causality-based Explanation of Classification Outcomes , 2020, DEEM@SIGMOD.

[29]  Diego Calvanese,et al.  Reasoning about Explanations for Negative Query Answers in DL-Lite , 2013, J. Artif. Intell. Res..

[30]  Maurizio Lenzerini,et al.  A Framework for Explaining Query Answers in DL-Lite , 2018, EKAW.

[31]  Y HalpernJoseph,et al.  Responsibility and blame , 2004 .

[32]  Laks V. S. Lakshmanan,et al.  Data Cleaning and Query Answering with Matching Dependencies and Matching Functions , 2010, ICDT '11.

[33]  Prasoon Goyal,et al.  Probabilistic Databases , 2009, Encyclopedia of Database Systems.

[34]  Leopoldo E. Bertossi,et al.  From Causes for Database Queries to Repairs and Model-Based Diagnosis and Back , 2015, ICDT.

[35]  Paolo Papotti,et al.  Descriptive and prescriptive data cleaning , 2014, SIGMOD Conference.

[36]  Lei Jiang,et al.  Data Quality Is Context Dependent , 2010, BIRTE.

[37]  Wenfei Fan,et al.  Foundations of Data Quality Management , 2012, Foundations of Data Quality Management.

[38]  Joseph Y. Halpern,et al.  Responsibility and Blame: A Structural-Model Approach , 2003, IJCAI.

[39]  Jianzhong Li,et al.  The VLDB Journal manuscript No. (will be inserted by the editor) Dynamic Constraints for Record Matching , 2022 .

[40]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[41]  Leopoldo E. Bertossi,et al.  ERBlox: Combining matching dependencies with machine learning for entity resolution , 2015, Int. J. Approx. Reason..