论文信息 - Tackling Simpson's Paradox in Big Data using Classification & Regression Trees

Tackling Simpson's Paradox in Big Data using Classification & Regression Trees

This work is aimed at finding potential Simpson’s paradoxes in Big Data. Simpson’s paradox (SP) arises when choosing the level of data aggregation for causal inference. It describes the phenomenon where the direction of a cause on an effect is reversed when examining the aggregate vs. disaggregates of a sample or population. The practical decision making dilemma that SP raises is which level of data aggregation presents the right answer. We propose a tree-based approach for detecting SP in data. Classification and regression trees are popular predictive algorithms that capture relationships between an outcome and set of inputs. They are used for record-level predictions and for variable selection. We introduce a novel usage for a cause-and-effect scenario with potential confounding variables. A tree is used to capture the relationship between the effect and the set of cause and potential confounders. We show that the tree structure determines whether a paradox is possible. The resulting tree graphically displays potential confounders and the confounding direction, allowing researchers or decision makers identify potential SPs to be further investigated with a causal toolkit.. We illustrate our SP detection approach using real data for both a single confounder and for multiple confounder in a large dataset on Kidney transplant waiting time.

Galit Shmueli | Inbal Yahav | Galit Shmueli | I. Yahav

[1] E. H. Simpson,et al. The Interpretation of Interaction in Contingency Tables , 1951 .

[2] A. Agresti,et al. Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[3] Galit Shmueli,et al. On information quality , 2012, SSRN Electronic Journal.

[4] J. Pearl. Causality: Models, Reasoning and Inference , 2000 .

[5] C. Blyth. On Simpson's Paradox and the Sure-Thing Principle , 1972 .

[6] K. Hornik,et al. Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[7] M. Schield. SIMPSON'S PARADOX AND CORNFIELD'S CONDITIONS , 1999 .

[8] Michael D. Perlman,et al. How Likely Is Simpson’s Paradox? , 2009 .

[9] J. Zidek. Maximal Simpson-disaggregations of 2×2 tables , 1984 .

[10] C. Neves. Categorical data analysis, third edition , 2014 .

[11] P. Holland,et al. Discrete Multivariate Analysis. , 1976 .