Testing for dependence on tree structures

Significance Tree-like structures are abundant in the empirical sciences as they can summarize high-dimensional data and show latent structure among many samples in a single framework. Prominent examples include phylogenetic trees or hierarchical clustering derived from genetic data. Currently, users employ ad hoc methods to test for association between a given tree and a response variable, which reduces reproducibility and robustness. In this paper, we introduce treeSeg, a simple to use and widely applicable methodology with high power for testing between all levels of hierarchy for a given tree and the response while accounting for the overall false positive rate. Our method allows for precise uncertainty quantification and therefore, increases interpretability and reproducibility of such studies across many fields of science. Tree structures, showing hierarchical relationships and the latent structures between samples, are ubiquitous in genomic and biomedical sciences. A common question in many studies is whether there is an association between a response variable measured on each sample and the latent group structure represented by some given tree. Currently, this is addressed on an ad hoc basis, usually requiring the user to decide on an appropriate number of clusters to prune out of the tree to be tested against the response variable. Here, we present a statistical method with statistical guarantees that tests for association between the response variable and a fixed tree structure across all levels of the tree hierarchy with high power while accounting for the overall false positive error rate. This enhances the robustness and reproducibility of such findings.

[1]  X. Didelot,et al.  Bayesian Inference of the Evolution of a Phenotype Distribution on a Phylogenetic Tree , 2016, Genetics.

[2]  L. Duembgen,et al.  Multiscale inference about a density , 2007, 0706.3968.

[3]  Susan P. Holmes,et al.  Multidomain analyses of a longitudinal human microbiome intestinal cleanout perturbation experiment , 2017, PLoS Comput. Biol..

[4]  S. Kou,et al.  Stepwise Signal Extraction via Marginal Likelihood , 2016, Journal of the American Statistical Association.

[5]  Laura Jula Vanegas,et al.  Multiscale quantile regression , 2019 .

[6]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[7]  Alessandro Rinaldo,et al.  Changepoint Detection over Graphs with the Spectral Scan Statistic , 2012, AISTATS.

[8]  Benjamin Yakir,et al.  Tail probabilities for the null distribution of scanning statistics , 1998 .

[9]  C. Spencer,et al.  Genome-to-genome analysis highlights the impact of the human innate and adaptive immune systems on the hepatitis C virus , 2017, Nature Genetics.

[10]  Multidimensional multiscale scanning in Exponential Families: Limit theory and statistical consequences , 2018, 1802.07995.

[11]  P. Fearnhead,et al.  Optimal detection of changepoints with a linear computational cost , 2011, 1101.1438.

[12]  A. Munk,et al.  Multiscale change point inference , 2013, 1301.7212.

[13]  David A. Clifton,et al.  Identifying lineage effects when controlling for population structure improves power in bacterial association studies , 2015, Nature Microbiology.

[14]  R. Gray,et al.  Language-tree divergence times support the Anatolian theory of Indo-European origin , 2003, Nature.

[15]  Peter Bühlmann,et al.  p-Values for High-Dimensional Regression , 2008, 0811.2177.

[16]  David O Siegmund,et al.  A Modified Bayes Information Criterion with Applications to the Analysis of Comparative Genomic Hybridization Data , 2007, Biometrics.

[17]  E. Purdom,et al.  Diversity of the Human Intestinal Microbial Flora , 2005, Science.

[18]  M. Suchard,et al.  The early spread and epidemic ignition of HIV-1 in human populations , 2014, Science.

[19]  H. Horvitz,et al.  MicroRNA expression profiles classify human cancers , 2005, Nature.

[20]  C. Holmes,et al.  Multiscale Blind Source Separation , 2016, 1608.07173.

[21]  M. A. Suchard,et al.  Distinguishable Epidemics of Multidrug-Resistant Salmonella Typhimurium DT104 in Different Hosts , 2013, Science.

[22]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[23]  V. Spokoiny,et al.  Multiscale testing of qualitative hypotheses , 2001 .

[24]  Hao Chen,et al.  Graph-based change-point detection , 2012, 1209.1625.