DoGR: Disaggregated Gaussian Regression for Reproducible Analysis of Heterogeneous Data

Quantitative analysis of large-scale data is often complicated by the presence of diverse subgroups, which reduce the accuracy of inferences they make on held-out data. To address the challenge of heterogeneous data analysis, we introduce DoGR, a method that discovers latent confounders by simultaneously partitioning the data into overlapping clusters (disaggregation) and modeling the behavior within them (regression). When applied to real-world data, our method discovers meaningful clusters and their characteristic behaviors, thus giving insight into group differences and their impact on the outcome of interest. By accounting for latent confounders, our framework facilitates exploratory analysis of noisy, heterogeneous data and can be used to learn predictive models that better generalize to new data. We provide the code to enable others to use DoGR within their data analytic workflows .

[1]  Helmuth Späth,et al.  Algorithm 39 Clusterwise linear regression , 1979, Computing.

[2]  D. Borsboom,et al.  Simpson's paradox in psychological science: a practical guide , 2013, Front. Psychol..

[3]  Naresh Manwani,et al.  K-plane regression , 2012, Inf. Sci..

[4]  J. Fox Applied Regression Analysis, Linear Models, and Related Methods , 1997 .

[5]  Richard McElreath,et al.  The natural selection of bad science , 2016, Royal Society Open Science.

[6]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[7]  Adler J. Perotte,et al.  Multiple Causal Inference with Latent Confounding , 2018, ArXiv.

[8]  W. DeSarbo,et al.  A maximum likelihood methodology for clusterwise linear regression , 1988 .

[9]  Amy Beth Warriner,et al.  Norms of valence, arousal, and dominance for 13,915 English lemmas , 2013, Behavior Research Methods.

[10]  Brendan Juba,et al.  Conditional Linear Regression , 2018, AAAI.

[11]  Michael I. Jordan,et al.  Supervised learning from incomplete data via an EM approach , 1993, NIPS.

[12]  W. Cleveland,et al.  Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting , 1988 .

[13]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[14]  Stanislas Leibler,et al.  Simpson's Paradox in a Synthetic Microbial System , 2009, Science.

[15]  Bodo Winter,et al.  A Very Basic Tutorial for Performing Linear Mixed Effects Analyses: Tutorial 2 , 2015 .

[16]  David M. Blei,et al.  The Blessings of Multiple Causes , 2018, Journal of the American Statistical Association.

[17]  A. Piquero,et al.  USING THE CORRECT STATISTICAL TEST FOR THE EQUALITY OF REGRESSION COEFFICIENTS , 1998 .

[18]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[19]  Paulo Cortez,et al.  Modeling wine preferences by data mining from physicochemical properties , 2009, Decis. Support Syst..

[20]  Elias Bareinboim,et al.  Causal inference and the data-fusion problem , 2016, Proceedings of the National Academy of Sciences.

[21]  Max Welling,et al.  Causal Effect Inference with Deep Latent-Variable Models , 2017, NIPS 2017.

[22]  Kristina Lerman,et al.  Can you Trust the Trend?: Discovering Simpson's Paradoxes in Social Data , 2018, WSDM.

[23]  R. Mclean,et al.  A Unified Approach to Mixed Linear Models , 1991 .

[24]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[25]  F. Anders,et al.  Yule-Simpson’s paradox in Galactic Archaeology , 2019, Monthly Notices of the Royal Astronomical Society.

[26]  C. Blyth On Simpson's Paradox and the Sure-Thing Principle , 1972 .

[27]  F. D. de Carvalho,et al.  On Combining Fuzzy C-Regression Models and Fuzzy C-Means with Automated Weighting of the Explanatory Variables , 2018, 2018 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[28]  Kristina Lerman,et al.  Using Simpson's Paradox to Discover Interesting Patterns in Behavioral Data , 2018, ICWSM.

[29]  Francisco de A. T. de Carvalho,et al.  On Combining Clusterwise Linear Regression and K-Means with Automatic Weighting of the Explanatory Variables , 2017, ICANN.

[30]  H. Sung Gaussian Mixture Regression and Classification , 2004 .