Reference-free comparison of microbial communities via de Bruijn graphs

Microbial communities inhabiting the human body exhibit significant variability across different individuals and tissues, and are suggested to play an important role in health and disease. High-throughput sequencing offers unprecedented possibilities to profile microbial community composition, but limitations of existing taxonomic classification methods (including incompleteness of existing microbial reference databases) limits the ability to accurately compare microbial communities across different samples. In this paper, we present a method able to overcome these limitations by circumventing the classification step and directly using the sequencing data to compare microbial communities. The proposed method provides a powerful reference-free way to assess differences in microbial abundances across samples. This method, called EMDeBruijn, condenses the sequencing data into a de Bruijn graph. The Earth Mover's Distance (EMD) is then used to measure similarities and differences of the microbial communities associated with the individual graphs. We apply this method to RNA-Seq data sets from a coronary artery calcification (CAC) study and shown that EMDeBruijn is able to differentiate between case and control CAC samples while utilizing all the candidate microbial reads. We compare these results to current reference-based methods, which are shown to have a limited capacity to discriminate between case and control samples. We conclude that this reference-free approach is a viable choice in comparative metatranscriptomic studies.

[1]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[2]  F. Shanahan,et al.  The gut flora as a forgotten organ , 2006, EMBO reports.

[3]  Curtis Huttenhower,et al.  A Guide to Enterotypes across the Human Body: Meta-Analysis of Microbial Community Structures in Human Microbiome Datasets , 2013, PLoS Comput. Biol..

[4]  Zamin Iqbal,et al.  Identifying and Classifying Trait Linked Polymorphisms in Non-Reference Species by Walking Coloured de Bruijn Graphs , 2013, PloS one.

[5]  J. Gordon,et al.  Human nutrition, the gut microbiome and the immune system , 2011, Nature.

[6]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[7]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[8]  Michael Werman,et al.  A Linear Time Histogram Metric for Improved SIFT Matching , 2008, ECCV.

[9]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[10]  Holly M. Bik,et al.  PhyloSift: phylogenetic analysis of genomes and metagenomes , 2014, PeerJ.

[11]  F. Raymond,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Ray Meta: scalable de novo metagenome assembly and profiling , 2012 .

[12]  Kristian Cibulskis,et al.  Genomic sequencing of colorectal adenocarcinomas identifies a recurrent VTI1A-TCF7L2 fusion , 2011, Nature Genetics.

[13]  Larry N. Singh,et al.  Integrative DNA, RNA, and protein evidence connects TREML4 to coronary artery calcification. , 2014, American journal of human genetics.

[14]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[15]  Michael Werman,et al.  Fast and robust Earth Mover's Distances , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[16]  M. Pop,et al.  Metagenomic Analysis of the Human Distal Gut Microbiome , 2006, Science.

[17]  Simon Foucart,et al.  WGSQuikr: Fast Whole-Genome Shotgun Metagenomic Classification , 2014, PloS one.

[18]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[19]  Rob Knight,et al.  Selection of primers for optimal taxonomic classification of environmental 16S rRNA gene sequences , 2012, The ISME Journal.

[20]  Mikael Skoglund,et al.  SEK: sparsity exploiting k-mer-based estimation of bacterial community composition , 2014, Bioinform..

[21]  Brian Goodman,et al.  The microbiome and cancer , 2018, The Journal of pathology.

[22]  Gail L. Rosen,et al.  Quikr: a method for rapid reconstruction of bacterial communities via compressive sensing , 2013, Bioinform..

[23]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[24]  Ryan D. Hernandez,et al.  Dumpster diving in RNA-sequencing to find the source of every last read , 2016 .

[25]  M. Pop,et al.  Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences , 2011, BMC Genomics.

[26]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[27]  Gail L. Rosen,et al.  NBC: the Naïve Bayes Classification tool webserver for taxonomic classification of metagenomic reads , 2010, Bioinform..