A causal view on compositional data

Many scientific datasets are compositional in nature. Important examples include species abundances in ecology, rock compositions in geology, topic compositions in large-scale text corpora, and sequencing count data in molecular biology. Here, we provide a causal view on compositional data in an instrumental variable setting where the composition acts as the cause. Throughout, we pay particular attention to the interpretation of compositional causes from the viewpoint of interventions and crisply articulate potential pitfalls for practitioners. Focusing on modern highdimensional microbiome sequencing data as a timely illustrative use case, our analysis first reveals that popular one-dimensional information-theoretic summary statistics, such as diversity and richness, may be insufficient for drawing causal conclusions from ecological data. Instead, we advocate for multivariate alternatives using statistical data transformations and regression techniques that take the special structure of the compositional sample space into account. In a comparative analysis on synthetic and semi-synthetic data we show the advantages and limitations of our proposal. We posit that our framework may provide a useful starting point for cause-effect estimation in the context of compositional data.

[1]  Hongzhe Li,et al.  Compositional Mediation Analysis for Microbiome Studies , 2017, bioRxiv.

[2]  G. Mateu-Figueras,et al.  Isometric Logratio Transformations for Compositional Data Analysis , 2003 .

[3]  K. Jarrod Millman,et al.  Array programming with NumPy , 2020, Nat..

[4]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[5]  Judea Pearl,et al.  On the Testability of Causal Models With Latent and Instrumental Variables , 1995, UAI.

[6]  J. Aitchison,et al.  Log contrast models for experiments with mixtures , 1984 .

[7]  R. Paredes,et al.  Balances: a New Perspective for Microbiome Analysis , 2017, mSystems.

[8]  R. Knight,et al.  The Human Microbiome Project , 2007, Nature.

[9]  Santu Rana,et al.  DeepCoDA: personalized interpretability for compositional health data , 2020, ICML.

[10]  J. Stock,et al.  Weak Instruments in Instrumental Variables Regression: Theory and Practice , 2019, Annual Review of Economics.

[11]  Xiaohong Chen,et al.  Semi‐Nonparametric IV Estimation of Shape‐Invariant Engel Curves , 2003 .

[12]  Skipper Seabold,et al.  Statsmodels: Econometric and Statistical Modeling with Python , 2010, SciPy.

[13]  M. Blaser,et al.  The impact of early-life sub-therapeutic antibiotic treatment (STAT) on excessive weight is robust despite transfer of intestinal microbes , 2019, The ISME Journal.

[14]  Kevin Leyton-Brown,et al.  Deep IV: A Flexible Approach for Counterfactual Prediction , 2017, ICML.

[15]  Liqing Zhang,et al.  DeepMicro: deep representation learning for disease prediction based on microbiome data , 2019, Scientific Reports.

[16]  Richard Bonneau,et al.  Disentangling microbial associations from hidden environmental and technical factors via latent graphical models , 2019, bioRxiv.

[17]  W. Newey,et al.  Instrumental variable estimation of nonparametric models , 2003 .

[18]  Bernard De Baets,et al.  Ecological Diversity: Measuring the Unmeasurable , 2018, Mathematics.

[19]  Harry H. Kelejian,et al.  Two-Stage Least Squares and Econometric Systems Linear in Parameters but Nonlinear in the Endogenous Variables , 1971 .

[20]  F. Chapin,et al.  Consequences of changing biodiversity , 2000, Nature.

[21]  M. Blaser,et al.  The human microbiome: at the interface of health and disease , 2012, Nature Reviews Genetics.

[22]  R. Knight,et al.  Host variables confound gut microbiota studies of human disease , 2020, Nature.

[23]  Santu Rana,et al.  DeepCoDA: personalized interpretability for compositional health , 2020, International Conference on Machine Learning.

[24]  Kellyn F Arnold,et al.  A causal inference perspective on the analysis of compositional data , 2020, International journal of epidemiology.

[25]  D. Rubin,et al.  Identification of Causal Effects Using Instrumental Variables: Rejoinder , 1996 .

[26]  Thomas P. Quinn,et al.  Understanding sequencing data as compositions: an outlook and review , 2017, bioRxiv.

[27]  Joshua D. Angrist,et al.  Mostly Harmless Econometrics: An Empiricist's Companion , 2008 .

[28]  A. Shade Diversity is the question, not the answer , 2016, The ISME Journal.

[29]  S. Lynch,et al.  The Human Intestinal Microbiome in Health and Disease. , 2016, The New England journal of medicine.

[30]  Arthur Gretton,et al.  Kernel Instrumental Variable Regression , 2019, NeurIPS.

[31]  V. Young,et al.  The gut microbiome in health and in disease , 2015, Current opinion in gastroenterology.

[32]  Blai Bonet,et al.  Instrumentality Tests Revisited , 2001, UAI.

[33]  James Versalovic,et al.  Human microbiome in health and disease. , 2012, Annual review of pathology.

[34]  Jean M. Macklaim,et al.  Microbiome Datasets Are Compositional: And This Is Not Optional , 2017, Front. Microbiol..

[35]  Hongzhe Li,et al.  Variable selection in regression with compositional covariates , 2014 .

[36]  Krikamol Muandet,et al.  Maximum Moment Restriction for Instrumental Variable Regression , 2020, ArXiv.

[37]  Florian Gunsilius Testability of instrument validity under continuous endogenous variables , 2018 .

[38]  Tom Leinster,et al.  Measuring diversity: the importance of species similarity. , 2012, Ecology.

[39]  Huilin Li,et al.  Estimating and testing the microbial causal mediation effect with high-dimensional and compositional microbiome data , 2019, bioRxiv.

[40]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[41]  Patrick L. Combettes,et al.  c-lasso - a Python package for constrained sparse and robust regression and classification , 2020, ArXiv.

[42]  Wei Xu,et al.  Assessment and Selection of Competing Models for Zero-Inflated Microbiome Data , 2015, PloS one.

[43]  A. Clark The Human Microbiome. , 2017, The American journal of nursing.

[44]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[45]  V. Pawlowsky-Glahn,et al.  Geometric approach to statistical analysis on the simplex , 2001 .

[46]  Frank Windmeijer,et al.  A weak instrument F-test in linear IV models with multiple endogenous variables☆ , 2013, Journal of Econometrics.

[47]  A. Willis Rarefaction, Alpha Diversity, and Statistics , 2017, bioRxiv.

[48]  J. Clemente,et al.  The Impact of the Gut Microbiota on Human Health: An Integrative View , 2012, Cell.

[49]  E. Murray,et al.  Compositional data call for complex interventions. , 2020, International journal of epidemiology.

[50]  R. Milo,et al.  Revised Estimates for the Number of Human and Bacteria Cells in the Body , 2016, bioRxiv.

[51]  J. Robins,et al.  Instruments for Causal Inference: An Epidemiologist's Dream? , 2006, Epidemiology.

[52]  A. Gasbarrini,et al.  Gut microbiome, big data and machine learning to promote precision medicine for cancer , 2020, Nature Reviews Gastroenterology & Hepatology.

[53]  Krikamol Muandet,et al.  Dual Instrumental Variable Regression , 2020, NeurIPS.

[54]  Zachary D. Kurtz,et al.  Antibiotic perturbation of the murine gut microbiome enhances the adiposity, insulin resistance, and liver disease associated with high-fat diet , 2016, Genome Medicine.

[55]  Michael J. T. Stubbington,et al.  The Human Cell Atlas: from vision to reality , 2017, Nature.

[56]  Anne Chao,et al.  Unifying Species Diversity, Phylogenetic Diversity, Functional Diversity, and Related Similarity and Differentiation Measures Through Hill Numbers , 2014 .

[57]  Andrew Bennett,et al.  Deep Generalized Method of Moments for Instrumental Variable Analysis , 2019, NeurIPS.

[58]  S. Peddada,et al.  Analysis of microbial compositions: a review of normalization and differential abundance analysis , 2020, npj Biofilms and Microbiomes.

[59]  M. Blaser,et al.  Antibiotics in early life alter the murine colonic microbiome and adiposity , 2012, Nature.

[60]  Christian L. Müller,et al.  Regression Models for Compositional Data: General Log-Contrast Formulations, Proximal Optimization, and Microbiome Data Applications , 2019, Statistics in Biosciences.

[61]  W. Greene,et al.  Accounting for Excess Zeros and Sample Selection in Poisson and Negative Binomial Regression Models , 1994 .

[62]  J. Prescott Missing Microbes. How the Overuse of Antibiotics is Fueling our Modern Plagues , 2015 .

[63]  Shyamal D. Peddada,et al.  Analysis of Microbiome Data in the Presence of Excess Zeros , 2017, Front. Microbiol..

[64]  Barbara Di Camillo,et al.  metaSPARSim: a 16S rRNA gene sequencing count data simulator , 2019, BMC Bioinformatics.

[65]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[66]  M. Gerstein,et al.  Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis , 2019, Nature Communications.

[67]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .