MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures

MOTIVATION Sequencing technologies allow the sequencing of microbial communities directly from the environment without prior culturing. Taxonomic analysis of microbial communities, a process referred to as binning, is one of the most challenging tasks when analyzing metagenomic reads data. The major problems are the lack of taxonomically related genomes in existing reference databases, the uneven abundance ratio of species and the limitations due to short read lengths and sequencing errors. RESULTS MetaProb is a novel assembly-assisted tool for unsupervised metagenomic binning. The novelty of MetaProb derives from solving a few important problems: how to divide reads into groups of independent reads, so that k-mer frequencies are not overestimated; how to convert k-mer counts into probabilistic sequence signatures, that will correct for variable distribution of k-mers, and for unbalanced groups of reads, in order to produce better estimates of the underlying genome statistic; how to estimate the number of species in a dataset. We show that MetaProb is more accurate and efficient than other state-of-the-art tools in binning both short reads datasets (F-measure 0.87) and long reads datasets (F-measure 0.97) for various abundance ratios. Also, the estimation of the number of species is more accurate than MetaCluster. On a real human stool dataset MetaProb identifies the most predominant species, in line with previous human gut studies. AVAILABILITY AND IMPLEMENTATION https://bitbucket.org/samu661/metaprob CONTACTS cinzia.pizzi@dei.unipd.it or comin@dei.unipd.it SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Cinzia Pizzi,et al.  MissMax: alignment-free sequence comparison with mismatches through filtering and heuristics , 2016, Algorithms for Molecular Biology.

[2]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[3]  Le Vinh,et al.  A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads , 2015, Algorithms for Molecular Biology.

[4]  Matteo Comin,et al.  Fast Entropic Profiler: An Information Theoretic Approach for the Discovery of Patterns in Genomes , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Steven Salzberg,et al.  GAGE-B: an evaluation of genome assemblers for bacterial organisms , 2013, Bioinform..

[6]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[7]  Roberto Grossi,et al.  Circular sequence comparison: algorithms and applications , 2016, Algorithms for Molecular Biology.

[8]  Alberto Apostolico,et al.  Alignment Free Sequence Similarity with Bounded Hamming Distance , 2014, 2014 Data Compression Conference.

[9]  Chun-Nan Hsu,et al.  Weakly supervised learning of biomedical information extraction from curated data , 2016, BMC Bioinformatics.

[10]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[11]  B. Chor,et al.  Genomic DNA k-mer spectra: models and modalities , 2009, Genome Biology.

[12]  Matteo Comin,et al.  Beyond Fixed-Resolution Alignment-Free Measures for Mammalian Enhancers Sequence Comparison , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[13]  Matteo Comin,et al.  Clustering of reads with alignment-free measures and quality values , 2014, Algorithms for Molecular Biology.

[14]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[15]  Laxmi Parida,et al.  Irredundant tandem motifs , 2014, Theor. Comput. Sci..

[16]  Gad M. Landau,et al.  Sequence similarity measures based on bounded hamming distance , 2016, Theor. Comput. Sci..

[17]  Jonathan A Eisen,et al.  Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes , 2007, PLoS biology.

[18]  Jens Roat Kultima,et al.  Potential of fecal microbiota for early‐stage detection of colorectal cancer , 2014 .

[19]  Tariq Moatter,et al.  Hepatitis B virus subgenotypes D1 and D3 are prevalent in Pakistan , 2009, BMC Research Notes.

[20]  Yeisoo Yu,et al.  Uncovering the novel characteristics of Asian honey bee, Apis cerana, by whole genome sequencing , 2015, BMC Genomics.

[21]  Monzoorul Haque Mohammed,et al.  Classification of metagenomic sequences: methods and challenges , 2012, Briefings Bioinform..

[22]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[23]  Raffaele Giancarlo,et al.  Speeding up the Consensus Clustering methodology for microarray data analysis , 2011, Algorithms for Molecular Biology.

[24]  Matteo Comin,et al.  Whole-Genome Phylogeny by Virtue of Unic Subwords , 2012, 2012 23rd International Workshop on Database and Expert Systems Applications.

[25]  Stephen M. Mount,et al.  Insights from GWAS: emerging landscape of mechanisms underlying complex trait disease , 2015, BMC Genomics.

[26]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[27]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.

[28]  Siu-Ming Yiu,et al.  MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample , 2012, Bioinform..

[29]  Kai Song,et al.  New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing , 2014, Briefings Bioinform..

[30]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[31]  Laxmi Parida,et al.  Entropic Profiles, Maximal Motifs and the Discovery of Significant Repetitions in Genomic Sequences , 2014, WABI.

[32]  Alberto Apostolico,et al.  Fast algorithms for computing sequence distances by exhaustive substring composition , 2008, Algorithms for Molecular Biology.

[33]  M. Waterman,et al.  Distributional regimes for the number of k-word matches between two random sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Alberto Apostolico,et al.  Efficient algorithms for the discovery of gapped factors , 2011, Algorithms for Molecular Biology.

[35]  Matteo Comin,et al.  On the comparison of regulatory sequences with multiple resolution Entropic Profiles , 2016, BMC Bioinformatics.

[36]  Alexandru I. Tomescu,et al.  MetaFlow: Metagenomic profiling based on whole-genome coverage analysis with min-cost flows , 2016 .

[37]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[38]  S. Lonardi,et al.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers , 2015, BMC Genomics.

[39]  Paul P. Gardner,et al.  An evaluation of the accuracy and speed of metagenome analysis tools , 2015, Scientific Reports.

[40]  Alexander Bockmayr,et al.  Double and multiple knockout simulations for genome-scale metabolic network reconstructions , 2015, Algorithms for Molecular Biology.

[41]  Matteo Comin,et al.  Fast Computation of Entropic Profiles for the Detection of Conservation in Genomes , 2013, PRIB.

[42]  Jonas S. Almeida,et al.  Entropic Profiler – detection of conservation in genomes using information theory , 2009, BMC Research Notes.

[43]  David Burstein,et al.  The Average Common Substring Approach to Phylogenomic Reconstruction , 2006, J. Comput. Biol..

[44]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[45]  Sonja J. Prohaska,et al.  The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies , 2016, Algorithms for Molecular Biology.