Bias in OLAP Queries: Detection, Explanation, and Removal

On line analytical processing (OLAP) is an essential element of decision-support systems. OLAP tools provide insights and understanding needed for improved decision making. However, the answers to OLAP queries can be biased and lead to perplexing and incorrect insights. In this paper, we propose, a system to detect, explain, and to resolve bias in decision-support queries. We give a simple definition of a biased query, which performs a set of independence tests on the data to detect bias. We propose a novel technique that gives explanations for bias, thus assisting an analyst in understanding what goes on. Additionally, we develop an automated method for rewriting a biased query into an unbiased query, which shows what the analyst intended to examine. In a thorough evaluation on several real datasets we show both the quality and the performance of our techniques, including the completely automatic discovery of the revolutionary insights from a famous 1973 discrimination case.

[1]  Carsten Binnig,et al.  What you see is not what you get!: Detecting Simpson's Paradoxes during Data Exploration , 2017, HILDA@SIGMOD.

[2]  W. Patefield,et al.  An Efficient Method of Generating Random R × C Tables with Given Row and Column Totals , 1981 .

[3]  N. Balov,et al.  How to use the catnet package , 2010 .

[4]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[5]  Franco Turini,et al.  k-NN as an implementation of situation testing for discrimination discovery and prevention , 2011, KDD.

[6]  Leopoldo E. Bertossi,et al.  Causes for query answers from databases: Datalog abduction, view-updates, and integrity constraints , 2016, Int. J. Approx. Reason..

[7]  Yannis Papakonstantinou,et al.  Hypothetical Queries in an OLAP Environment , 2000, VLDB.

[8]  Dan Suciu,et al.  The Complexity of Causality and Responsibility for Query Answers and non-Answers , 2010, Proc. VLDB Endow..

[9]  T. Richardson,et al.  Covariate selection for the nonparametric estimation of an average treatment effect , 2011 .

[10]  Alex A. Freitas,et al.  Are we really discovering ''interesting'' knowledge from data? , 2006 .

[11]  J. Pearl [Bayesian Analysis in Expert Systems]: Comment: Graphical Models, Causality and Intervention , 1993 .

[12]  D. Rubin Statistics and Causal Inference: Comment: Which Ifs Have Causal Answers , 1986 .

[13]  Stefano M. Iacus,et al.  cem: Software for Coarsened Exact Matching , 2009, Journal of Statistical Software.

[14]  P. Bickel,et al.  Sex Bias in Graduate Admissions: Data from Berkeley , 1975, Science.

[15]  J. Pearl Simpson's Paradox: An Anatomy , 2011 .

[16]  Judea Pearl,et al.  Direct and Indirect Effects , 2001, UAI.

[17]  Toniann Pitassi,et al.  Learning Fair Representations , 2013, ICML.

[18]  Babak Salimi,et al.  Query-Answer Causality in Databases and Its Connections with Reverse Reasoning Tasks in Data and Knowledge Management , 2016 .

[19]  Daniel Deutch,et al.  Caravan: Provisioning for What-If Analysis , 2013, CIDR.

[20]  Eric Neufeld,et al.  Whether Non-Correlation Implies Non-Causation , 2005, FLAIRS.

[21]  Tim Kraska,et al.  Toward Sustainable Insights, or Why Polygamy is Bad for You , 2017, CIDR.

[22]  Babak Salimi,et al.  From Causes for Database Queries to Repairs and Model-Based Diagnosis and Back , 2014, Theory of Computing Systems.

[23]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[24]  Terry L King A Guide to Chi-Squared Testing , 1997 .

[25]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[26]  Franco Turini,et al.  Discrimination-aware data mining , 2008, KDD.

[27]  Dan Suciu,et al.  ZaliQL: Causal Inference from Observational Data at Scale , 2017, Proc. VLDB Endow..

[28]  Laks V. S. Lakshmanan,et al.  What-if OLAP Queries with Changing Dimensions , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[29]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[30]  Sheila R. Foster,et al.  Causation in Antidiscrimination Law: Beyond Intent Versus Impact , 2004 .

[31]  André Elisseeff,et al.  Using Markov Blankets for Causal Structure Learning , 2008, J. Mach. Learn. Res..

[32]  Moritz Grosse-Wentrup,et al.  Quantifying causal influences , 2012, 1203.6502.

[33]  Alex Alves Freitas,et al.  Understanding the Crucial Role of Attribute Interaction in Data Mining , 2001, Artificial Intelligence Review.

[34]  P. Holland Statistics and Causal Inference , 1985 .

[35]  Sebastian Thrun,et al.  Bayesian Network Induction via Local Neighborhoods , 1999, NIPS.

[36]  Padhraic Smyth,et al.  Statistical Themes and Lessons for Data Mining , 2004, Data Mining and Knowledge Discovery.

[37]  J. Pearl Causal inference in statistics: An overview , 2009 .

[38]  Peter Spirtes,et al.  Introduction to Causal Inference , 2010, J. Mach. Learn. Res..

[39]  Z. Jane Wang,et al.  Controlling the False Discovery Rate of the Association/Causality Structure Learned with the PC Algorithm , 2009, J. Mach. Learn. Res..

[40]  Ga Miller,et al.  Note on the bias of information estimates , 1955 .

[41]  Shili Lin,et al.  Rank aggregation methods , 2010 .

[42]  Roxana Geambasu,et al.  FairTest: Discovering Unwarranted Associations in Data-Driven Applications , 2015, 2017 IEEE European Symposium on Security and Privacy (EuroS&P).

[43]  Guy Van den Broeck,et al.  Quantifying Causal Effects on Query Answering in Databases , 2016, TaPP.

[44]  Wes McKinney,et al.  pandas: a Foundational Python Library for Data Analysis and Statistics , 2011 .

[45]  Alex Alves Freitas,et al.  Discovering Surprising Instances of Simpson's Paradox in Hierarchical Multidimensional Data , 2006, Int. J. Data Warehous. Min..

[46]  Tom Burr,et al.  Causation, Prediction, and Search , 2003, Technometrics.

[47]  Dan Suciu,et al.  A formal approach to finding explanations for database queries , 2014, SIGMOD Conference.

[48]  Constantin F. Aliferis,et al.  Algorithms for Large Scale Markov Blanket Discovery , 2003, FLAIRS.

[49]  Toon Calders,et al.  Handling Conditional Discrimination , 2011, 2011 IEEE 11th International Conference on Data Mining.

[50]  Radhakrishnan Nagarajan,et al.  Bayesian Networks in R , 2013 .