MetaLonDA: a flexible R package for identifying time intervals of differentially abundant features in metagenomic longitudinal studies

BackgroundMicrobial longitudinal studies are powerful experimental designs utilized to classify diseases, determine prognosis, and analyze microbial systems dynamics. In longitudinal studies, only identifying differential features between two phenotypes does not provide sufficient information to determine whether a change in the relative abundance is short-term or continuous. Furthermore, sample collection in longitudinal studies suffers from all forms of variability such as a different number of subjects per phenotypic group, a different number of samples per subject, and samples not collected at consistent time points. These inconsistencies are common in studies that collect samples from human subjects.ResultsWe present MetaLonDA, an R package that is capable of identifying significant time intervals of differentially abundant microbial features. MetaLonDA is flexible such that it can perform differential abundance tests despite inconsistencies associated with sample collection. Extensive experiments on simulated datasets quantitatively demonstrate the effectiveness of MetaLonDA with significant improvement over alternative methods. We applied MetaLonDA to the DIABIMMUNE cohort (https://pubs.broadinstitute.org/diabimmune) substantiating significant early lifetime intervals of exposure to Bacteroides and Bifidobacterium in Finnish and Russian infants. Additionally, we established significant time intervals during which novel differentially relative abundant microbial genera may contribute to aberrant immunogenicity and development of autoimmune disease.ConclusionMetaLonDA is computationally efficient and can be run on desktop machines. The identified differentially abundant features and their time intervals have the potential to distinguish microbial biomarkers that may be used for microbial reconstitution through bacteriotherapy, probiotics, or antibiotics. Moreover, MetaLonDA can be applied to any longitudinal count data such as metagenomic sequencing, 16S rRNA gene sequencing, or RNAseq. MetaLonDA is publicly available on CRAN (https://CRAN.R-project.org/package=MetaLonDA).

[1]  S. Salzberg,et al.  Centrifuge: rapid and sensitive classification of metagenomic sequences , 2016, bioRxiv.

[2]  Erik Kristiansson,et al.  Statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics , 2016, BMC Genomics.

[3]  K. R. Qazi,et al.  Lactobacillus reuteri and Staphylococcus aureus differentially influence the generation of monocyte‐derived dendritic cells and subsequent autologous T cell responses , 2016, Immunity, inflammation and disease.

[4]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[5]  Achim Zeileis,et al.  Flexible Generation of E-Learning Exams in R: Moodle Quizzes, OLAT Assessments, and Beyond , 2014 .

[6]  Patricia W. Finn,et al.  Detection of Differential Abundance Intervals in Longitudinal Metagenomic Data Using Negative Binomial Smoothing Spline ANOVA , 2017, BCB.

[7]  A. Zeileis,et al.  Regression Models for Count Data in R , 2008 .

[8]  L. Wen,et al.  Microbial antigen mimics activate diabetogenic CD8 T cells in NOD mice , 2016, The Journal of experimental medicine.

[9]  Duy Tin Truong,et al.  MetaPhlAn2 for enhanced metagenomic taxonomic profiling , 2015, Nature Methods.

[10]  Tommi Vatanen,et al.  The dynamics of the human infant gut microbiome in development and in progression toward type 1 diabetes. , 2015, Cell host & microbe.

[11]  Jesse R. Zaneveld,et al.  Normalization and microbial differential abundance strategies depend upon data characteristics , 2017, Microbiome.

[12]  Chong Gu Smoothing Spline Anova Models , 2002 .

[13]  R. Knight,et al.  Fast UniFrac: Facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data , 2009, The ISME Journal.

[14]  J. Neu,et al.  Succession of microbial consortia in the developing infant gut microbiome , 2011 .

[15]  B. P. Murphy,et al.  Handbook of Methods of Applied Statistics , 1968 .

[16]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[17]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[18]  Ahmed A. Metwally,et al.  WEVOTE: Weighted Voting Taxonomic Identification Method of Microbial Sequences , 2016, bioRxiv.

[19]  Vanni Bucci,et al.  MDSINE: Microbial Dynamical Systems INference Engine for microbiome time-series analyses , 2016, Genome Biology.

[20]  Chong Gu,et al.  Smoothing Spline ANOVA Models: R Package gss , 2014 .

[21]  Aaron W Miller,et al.  Modeling time-series data from microbial communities , 2016 .

[22]  Thomas W. Yee,et al.  Vector Generalized Linear and Additive Models , 2015 .

[23]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[24]  Mihai Pop,et al.  Longitudinal analysis of the lung microbiota of cynomolgous macaques during long-term SHIV infection , 2016, Microbiome.

[25]  Tommi Vatanen,et al.  Variation in Microbiome LPS Immunogenicity Contributes to Autoimmunity in Humans , 2016, Cell.

[26]  Derek Reiman,et al.  Using convolutional neural networks to explore the microbiome , 2017, 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[27]  Jennifer M. Fettweis,et al.  The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies , 2015, BMC Microbiology.

[28]  R. Knight,et al.  Supervised classification of human microbiota. , 2011, FEMS microbiology reviews.

[29]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[30]  R. Knight,et al.  Moving pictures of the human microbiome , 2011, Genome Biology.

[31]  M. Pop,et al.  Robust methods for differential abundance analysis in marker gene surveys , 2013, Nature Methods.

[32]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[33]  Dan Luo,et al.  An informative approach on differential abundance analysis for time‐course metagenomic sequencing data , 2017, Bioinform..

[34]  G. Wahba,et al.  Smoothing spline ANOVA for exponential families, with application to the Wisconsin Epidemiological Study of Diabetic Retinopathy : the 1994 Neyman Memorial Lecture , 1995 .

[35]  Thomas Yee,et al.  VGAM: Vector Generalized Linear and Additive Models 1.0-4 , 2017 .

[36]  Bernard Henrissat,et al.  Metabolic Reconstruction for Metagenomic Data and Its Application to the Human Microbiome , 2012, PLoS Comput. Biol..

[37]  P. A. Blight The Analysis of Time Series: An Introduction , 1991 .

[38]  Gregory Ditzler,et al.  Multi-Layer and Recursive Neural Networks for Metagenomic Classification , 2015, IEEE Transactions on NanoBioscience.

[39]  J. T. Curtis,et al.  An Ordination of the Upland Forest Communities of Southern Wisconsin , 1957 .

[40]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .