Fast computation of genome-metagenome interaction effects

Motivation Association studies have been widely used to search for associations between common genetic variants observations and a given phenotype. However, it is now generally accepted that genes and environment must be examined jointly when estimating phenotypic variance. In this work we consider two types of biological markers: genotypic markers, which characterize an observation in terms of inherited genetic information, and metagenomic marker which are related to the environment. Both types of markers are available in their millions and can be used to characterize any observation uniquely. Objective Our focus is on detecting interactions between groups of genetic and metagenomic markers in order to gain a better understanding of the complex relationship between environment and genome in the expression of a given phenotype. Contributions We propose a novel approach for efficiently detecting interactions between complementary datasets in a high-dimensional setting with a reduced computational cost. The method, named SICOMORE, reduces the dimension of the search space by selecting a subset of supervariables in the two complementary datasets. These supervariables are given by a weighted group structure defined on sets of variables at different scales. A Lasso selection is then applied on each type of supervariable to obtain a subset of potential interactions that will be explored via linear model testing. Results We compare SICOMORE with other approaches in simulations, with varying sample sizes, noise, and numbers of true interactions. SICOMORE exhibits convincing results in terms of recall, as well as competitive performances with respect to running time. The method is also used to detect interaction between genomic markers in Medicago truncatula and metagenomic markers in its rhizosphere bacterial community. Software availability An R package is available [ 4 ], along with its documentation and associated scripts, allowing the reader to reproduce the results presented in the paper.

[1]  Qiang Feng,et al.  A metagenome-wide association study of gut microbiota in type 2 diabetes , 2012, Nature.

[2]  Lana X. Garmire,et al.  More Is Better: Recent Progress in Multi-Omics Data Integration Methods , 2017, Front. Genet..

[3]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[4]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[5]  M. Knörnschild,et al.  Corrigendum: Bats host major mammalian paramyxoviruses , 2014, Nature Communications.

[6]  Dan Knights,et al.  Complex host genetics influence the microbiome in inflammatory bowel disease , 2014, Genome Medicine.

[7]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[8]  Benjamin Hofner,et al.  Controlling false discoveries in high-dimensional situations: boosting with stability selection , 2014, BMC Bioinformatics.

[9]  Bjarni J. Vilhjálmsson,et al.  Genome-wide association study of Arabidopsis thaliana's leaf microbial community , 2014, Nature Communications.

[10]  B. Lugtenberg,et al.  Plant-growth-promoting rhizobacteria. , 2009, Annual review of microbiology.

[11]  Alioune Ngom,et al.  A review on machine learning principles for multi-view biological data integration , 2016, Briefings Bioinform..

[12]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[13]  Ruben Garrido-Oter,et al.  Interplay Between Innate Immunity and the Plant Microbiota. , 2017, Annual review of phytopathology.

[14]  Jun Wang,et al.  Metagenome-wide association studies: fine-mining the microbiome , 2016, Nature Reviews Microbiology.

[15]  Gregory B. Gloor,et al.  Compositional uncertainty should not be ignored in high-throughput sequencing data analysis , 2016 .

[16]  Li Yuan,et al.  PseKRAAC: 擬K‐タプル還元アミノ酸組成を生成するためのフレキシブルウェブサーバ , 2017 .

[17]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[18]  C. Pieterse,et al.  The rhizosphere microbiome and plant health. , 2012, Trends in plant science.

[19]  Roberto Pinton,et al.  The rhizosphere : biochemistry and organic substances at the soil-plant interface , 2007 .

[20]  Bernhard Y. Renard,et al.  Abundance estimation and differential testing on strain level in metagenomics data , 2017, Bioinform..

[21]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[22]  William Underwood,et al.  The Plant Cell Wall: A Dynamic Barrier Against Pathogen Invasion , 2012, Front. Plant Sci..

[23]  Jacqueline Clavel,et al.  Progress in the epidemiological understanding of gene-environment interactions in major diseases: cancer. , 2007, Comptes rendus biologies.

[24]  Yongmei Cheng,et al.  A Comparison of Methods for Clustering 16S rRNA Sequences into OTUs , 2013, PloS one.

[25]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[26]  Christophe Ambroise,et al.  Performance of a blockwise approach in variable selection using linkage disequilibrium information , 2015, BMC Bioinformatics.

[27]  Y. She,et al.  Group Regularized Estimation Under Structural Hierarchy , 2014, 1411.4691.

[28]  German Spangenberg,et al.  Functional Analyses of Caffeic Acid O-Methyltransferase and Cinnamoyl-CoA-Reductase Genes from Perennial Ryegrass (Lolium perenne)[W] , 2010, Plant Cell.

[29]  K. Pearson Mathematical contributions to the theory of evolution.—On a form of spurious correlation which may arise when indices are used in the measurement of organs , 1897, Proceedings of the Royal Society of London.

[30]  Eric P. Xing,et al.  Ensembles of Lasso Screening Rules , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[32]  D. Thomas,et al.  Gene–environment-wide association studies: emerging approaches , 2010, Nature Reviews Genetics.

[33]  Peter Kraft,et al.  Gene‐Environment Interactions in Cancer Epidemiology: A National Cancer Institute Think Tank Report , 2013, Genetic epidemiology.

[34]  K. G. Mukerji,et al.  Techniques in Mycorrhizal Studies , 2002, Springer Netherlands.

[35]  Hongzhe Li Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis , 2015 .

[36]  W. M. de Vos,et al.  Human Microbiota in Health and Disease , 2012 .

[37]  T. Hastie,et al.  Learning Interactions via Hierarchical Group-Lasso Regularization , 2015, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[38]  Peter Donnelly,et al.  HAPGEN2: simulation of multiple disease SNPs , 2011, Bioinform..

[39]  Xihong Lin,et al.  Test for interactions between a genetic marker set and environment in generalized linear models. , 2013, Biostatistics.

[40]  S. Hacquard,et al.  Microbial interactions within the plant holobiont , 2018, Microbiome.

[41]  Francis R. Bach,et al.  Bolasso: model consistent Lasso estimation through the bootstrap , 2008, ICML '08.

[42]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[43]  J. Goeman,et al.  Multiple Testing for Exploratory Research , 2011, 1208.2841.

[44]  Nilanjan Chatterjee,et al.  Review of Statistical Methods for Gene-Environment Interaction Analysis , 2018, Current Epidemiology Reports.

[45]  Christophe Ambroise,et al.  Learning the optimal scale for GWAS through hierarchical SNP aggregation , 2017, BMC Bioinformatics.

[46]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[47]  Stefan Bertilsson,et al.  Oral Microbiota Development in Early Childhood , 2019, Scientific Reports.

[48]  Christophe Ambroise,et al.  Eigen-Epistasis for detecting gene-gene interactions , 2016, BMC Bioinformatics.

[49]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[50]  Mark Blaxter,et al.  Defining operational taxonomic units using DNA barcode data , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[51]  Joy Bergelson,et al.  Characterizing both bacteria and fungi improves understanding of the Arabidopsis root microbiome , 2019, Scientific Reports.

[52]  Torsten Hothorn,et al.  Stability Selection with Error Control , 2015 .

[53]  Ruth J. Muschel,et al.  Corrigendum: Cancer cells that survive radiation therapy acquire HIF-1 activity and translocate toward tumour blood vessels , 2013, Nature Communications.

[54]  Trevor Hastie,et al.  Averaged gene expressions for regression. , 2007, Biostatistics.

[55]  G. Srinivas,et al.  Genome-wide mapping of gene–microbiota interactions in susceptibility to autoimmune skin blistering , 2013, Nature Communications.

[56]  Fabian J. Theis,et al.  Inferring Interaction Networks From Multi-Omics Data , 2019, Front. Genet..

[57]  Yaakov Tsaig,et al.  Fast Solution of $\ell _{1}$ -Norm Minimization Problems When the Solution May Be Sparse , 2008, IEEE Transactions on Information Theory.

[58]  Jean M. Macklaim,et al.  Microbiome Datasets Are Compositional: And This Is Not Optional , 2017, Front. Microbiol..

[59]  R. Knight,et al.  PyCogent: a toolkit for making sense from sequence , 2007, Genome Biology.

[60]  Patrick Wincker,et al.  Molecular biomass and MetaTaxogenomic assessment of soil microbial communities as influenced by soil DNA extraction procedure , 2011, Microbial biotechnology.

[61]  A. Rau,et al.  Statistical methods and software for the analysis of transcriptomic data , 2017 .

[62]  Gary Stacey,et al.  Rhizobium-legume symbioses: the crucial role of plant immunity. , 2015, Trends in plant science.

[63]  Jean-Philippe Vert,et al.  Group lasso with overlap and graph lasso , 2009, ICML '09.

[64]  U. Nöthlings,et al.  Genome-wide association analysis identifies variation in vitamin D receptor and other host factors influencing the gut microbiota , 2016, Nature Genetics.

[65]  R. Tibshirani,et al.  Strong rules for discarding predictors in lasso‐type problems , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[66]  Joy Bergelson,et al.  References and Notes Supporting Online Material Adaptation to Climate across the Arabidopsis Thaliana Genome , 2022 .

[67]  J. Aitchison,et al.  The multivariate Poisson-log normal distribution , 1989 .

[68]  Antonio Bispo,et al.  Meta-barcoded evaluation of the ISO standard 11063 DNA extraction procedure to characterize soil bacterial and fungal community diversity and composition , 2014, Microbial biotechnology.

[69]  Quentin Grimonprez,et al.  Sélection de groupes de variables corrélées en grande dimension , 2016 .

[70]  David N. Fredricks,et al.  The Human Microbiota: How Microbial Communities Affect Health and Disease , 2013 .

[71]  C. Huttenhower,et al.  Metagenomic biomarker discovery and explanation , 2011, Genome Biology.

[72]  N. J. Brewin,et al.  Plant Cell Wall Remodelling in the Rhizobium–Legume Symbiosis , 2004 .

[73]  R. Tibshirani,et al.  A LASSO FOR HIERARCHICAL INTERACTIONS. , 2012, Annals of statistics.

[74]  Bertrand Thirion,et al.  Statistical Inference with Ensemble of Clustered Desparsified Lasso , 2018, MICCAI.

[75]  Jesse R. Zaneveld,et al.  Normalization and microbial differential abundance strategies depend upon data characteristics , 2017, Microbiome.

[76]  Paul J. McMurdie,et al.  Exact sequence variants should replace operational taxonomic units in marker-gene data analysis , 2017, The ISME Journal.