False discovery rate control incorporating phylogenetic tree increases detection power in microbiome‐wide multiple testing

Motivation: Next generation sequencing technologies have enabled the study of the human microbiome through direct sequencing of microbial DNA, resulting in an enormous amount of microbiome sequencing data. One unique characteristic of microbiome data is the phylogenetic tree that relates all the bacterial species. Closely related bacterial species have a tendency to exhibit a similar relationship with the environment or disease. Thus, incorporating the phylogenetic tree information can potentially improve the detection power for microbiome‐wide association studies, where hundreds or thousands of tests are conducted simultaneously to identify bacterial species associated with a phenotype of interest. Despite much progress in multiple testing procedures such as false discovery rate (FDR) control, methods that take into account the phylogenetic tree are largely limited. Results: We propose a new FDR control procedure that incorporates the prior structure information and apply it to microbiome data. The proposed procedure is based on a hierarchical model, where a structure‐based prior distribution is designed to utilize the phylogenetic tree. By borrowing information from neighboring bacterial species, we are able to improve the statistical power of detecting associated bacterial species while controlling the FDR at desired levels. When the phylogenetic tree is mis‐specified or non‐informative, our procedure achieves a similar power as traditional procedures that do not take into account the tree structure. We demonstrate the performance of our method through extensive simulations and real microbiome datasets. We identified far more alcohol‐drinking associated bacterial species than traditional methods. Availability and implementation: R package StructFDR is available from CRAN. Contact: chen.jun2@mayo.edu Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Elizabeth Purdom,et al.  Analysis of a data matrix and a graph: Metagenomic data and the phylogenetic tree , 2011, 1202.5880.

[2]  J. Raes,et al.  Microbial interactions: from networks to models , 2012, Nature Reviews Microbiology.

[3]  Nicholas Chia,et al.  Impact of demographics on human gut microbial diversity in a US Midwest population , 2016, PeerJ.

[4]  D. Allison,et al.  Statistical Applications in Genetics and Molecular Biology Weighted Multiple Hypothesis Testing Procedures , 2011 .

[5]  Wenguang Sun,et al.  Large‐scale multiple testing under dependence , 2009 .

[6]  A. Owen Variance of the number of false discoveries , 2005 .

[7]  Kris Sankaran,et al.  structSSI: Simultaneous and Selective Inference for Grouped or Hierarchically Structured Data. , 2014, Journal of statistical software.

[8]  Jay T. Lennon,et al.  Microbiomes in light of traits: A phylogenetic perspective , 2015, Science.

[9]  John D. Storey,et al.  Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach , 2004 .

[10]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[11]  Jennifer L. O'Day Statistical Significance for Genome Wide Studies Under Unequal Variance , 2015 .

[12]  Timothy L. Tickle,et al.  Associations between host gene expression, the mucosal microbiome, and clinical outcome in the pelvic pouch of patients with inflammatory bowel disease , 2015, Genome Biology.

[13]  Wei Pan,et al.  Gene expression A note on using permutation-based false discovery rate estimates to compare different analysis methods for microarray data , 2005 .

[14]  Jianqing Fan,et al.  Journal of the American Statistical Association Estimating False Discovery Proportion under Arbitrary Covariance Dependence Estimating False Discovery Proportion under Arbitrary Covariance Dependence , 2022 .

[15]  T. F. Hansen,et al.  Phylogenies and the Comparative Method: A General Approach to Incorporating Phylogenetic Information into the Analysis of Interspecific Data , 1997, The American Naturalist.

[16]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[17]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[18]  John D. Storey A direct approach to false discovery rates , 2002 .

[19]  Judith B. Zaugg,et al.  Data-driven hypothesis weighting increases detection power in genome-scale multiple testing , 2016, Nature Methods.

[20]  Hongyuan Cao,et al.  Changepoint estimation: another look at multiple testing problems , 2015 .

[21]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[22]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[23]  Hongzhe Li,et al.  A Markov random field model for network-based analysis of genomic data , 2007, Bioinform..

[24]  R. Dougherty,et al.  FALSE DISCOVERY RATE ANALYSIS OF BRAIN DIFFUSION DIRECTION MAPS. , 2008, The annals of applied statistics.

[25]  W. Wu,et al.  On false discovery control under dependence , 2008, 0803.1971.

[26]  Hongzhe Li,et al.  Network-Based Empirical Bayes Methods for Linear Models with Applications to Genomic Data , 2010, Journal of biopharmaceutical statistics.

[27]  J. A. Ferreira,et al.  On the Benjamini-Hochberg method , 2006, math/0611265.

[28]  F. Bushman,et al.  Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis. , 2013, Biostatistics.

[29]  N. Draper,et al.  Applied Regression Analysis , 1967 .

[30]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[31]  D. Yekutieli Hierarchical False Discovery Rate–Controlling Methodology , 2008 .

[32]  W C Willett,et al.  Adjustment for total energy intake in epidemiologic studies. , 1997, The American journal of clinical nutrition.

[33]  Christopher J. Miller,et al.  Controlling the False-Discovery Rate in Astrophysical Data Analysis , 2001, astro-ph/0107034.

[34]  Jeffrey T Leek,et al.  A general framework for multiple testing dependence , 2008, Proceedings of the National Academy of Sciences.

[35]  Hongzhe Li,et al.  Optimal False Discovery Rate Control for Dependent Data. , 2011, Statistics and its interface.

[36]  Jianqing Fan,et al.  Control of the False Discovery Rate Under Arbitrary Covariance Dependence , 2010, 1012.4397.

[37]  Duy Tin Truong,et al.  Strain-level microbial epidemiology and population genomics from shotgun metagenomics , 2016, Nature Methods.

[38]  Miguel Verdú,et al.  Predicting microbial traits with phylogenies , 2015, The ISME Journal.

[39]  A. Keshavarzian,et al.  The Gastrointestinal Microbiome: Alcohol Effects on the Composition of Intestinal Microbiota. , 2015 .

[40]  Chloé Friguet,et al.  A Factor Model Approach to Multiple Testing Under Dependence , 2009 .

[41]  William A. Walters,et al.  Experimental and analytical tools for studying the human microbiome , 2011, Nature Reviews Genetics.

[42]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[43]  Michael R Kosorok,et al.  Simultaneous Critical Values For T-Tests In Very High Dimensions. , 2011, Bernoulli : official journal of the Bernoulli Society for Mathematical Statistics and Probability.

[44]  M. Boehnke,et al.  So many correlated tests, so little time! Rapid adjustment of P values for multiple correlated tests. , 2007, American journal of human genetics.

[45]  N. Draper,et al.  Applied Regression Analysis: Draper/Applied Regression Analysis , 1998 .

[46]  Hongzhe Li,et al.  Associating microbiome composition with environmental covariates using generalized UniFrac distances , 2012, Bioinform..

[47]  Lawrence A. David,et al.  A phylogenetic transform enhances analysis of compositional microbiota data , 2016, bioRxiv.

[48]  Harrison H. Zhou,et al.  False Discovery Rate Control With Groups , 2010, Journal of the American Statistical Association.

[49]  F. Bushman,et al.  Linking Long-Term Dietary Patterns with Gut Microbial Enterotypes , 2011, Science.

[50]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[51]  Sébastien Matamoros,et al.  Intestinal permeability, gut-bacterial dysbiosis, and behavioral markers of alcohol-dependence severity , 2014, Proceedings of the National Academy of Sciences.

[52]  James T. Morton,et al.  Microbiome-wide association studies link dynamic microbial consortia to disease , 2016, Nature.