MetaFlow: Metagenomic profiling based on whole-genome coverage analysis with min-cost flows

High-throughput sequencing (HTS) of metagenomes is proving essential in understanding the environment and diseases. State-of-the-art methods for discovering the species and their abundances in an HTS metagenomic sample are based on genome-specific markers, which can lead to skewed results, especially at species level. We present MetaFlow, the first method based on coverage analysis across entire genomes that also scales to HTS samples. We formulated this problem as an NP-hard matching problem in a bipartite graph, which we solved in practice by min-cost flows. On synthetic data sets of varying complexity and similarity, MetaFlow is more precise and sensitive than popular tools such as MetaPhlAn, mOTU, GSMer and BLAST, and its abundance estimations at species level are two to four times better in terms of ℓ1-norm. On a real human stool data set, MetaFlow identifies B.uniformis as most predominant, in line with previous human gut studies, whereas marker-based methods report it as rare. MetaFlow is freely available at http://cs.helsinki.fi/gsa/metaflow

[1]  Zoltán Király,et al.  Efficient implementations of minimum-cost flow algorithms , 2012, ArXiv.

[2]  Christopher A. Miller,et al.  ReadDepth: A Parallel R Package for Detecting Copy Number Alterations from Short Sequencing Reads , 2011, PloS one.

[3]  Qichao Tu,et al.  Strain/species identification in metagenomes using genome-specific markers , 2014, Nucleic acids research.

[4]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[5]  Alexandros Stamatakis,et al.  Metagenomic species profiling using universal phylogenetic marker genes , 2013, Nature Methods.

[6]  Péter Kovács,et al.  LEMON - an Open Source C++ Graph Template Library , 2011, WGT@ETAPS.

[7]  Vineet Bafna,et al.  Evaluating genome architecture of a complex region via generalized bipartite matching , 2013, BMC Bioinformatics.

[8]  Jason Raymond,et al.  The natural history of nitrogen fixation. , 2004, Molecular biology and evolution.

[9]  Manesh Shah,et al.  Genome divergence in two Prochlorococcus ecotypes reflects oceanic niche differentiation , 2003, Nature.

[10]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[11]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[12]  K. Konstantinidis,et al.  Strengths and Limitations of 16S rRNA Gene Amplicon Sequencing in Revealing Temporal Microbial Community Dynamics , 2014, PloS one.

[13]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[14]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[15]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[16]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[17]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[18]  Sean D. Hooper,et al.  Estimating DNA coverage and abundance in metagenomes using a gamma approximation , 2009, Bioinform..

[19]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[20]  Bernhard Y. Renard,et al.  Analyzing genome coverage profiles with applications to quality control in metagenomics , 2013, Bioinform..

[21]  Alexandru I. Tomescu,et al.  Genome-Scale Algorithm Design: Genomics , 2015 .

[22]  A. Salamov,et al.  Use of simulated data sets to evaluate the fidelity of metagenomic processing methods , 2007, Nature Methods.