MarkovBin: An Algorithm to Cluster Metagenomic Reads Using a Mixture Modeling of Hierarchical Distributions

Metagenomics is the study of genomic content of microorganisms from environmental samples without isolation and cultivation. Recently developed next generation sequencing (NGS) technologies efficiently generate vast amounts of metagenomic DNA sequences. However, the ultra-high throughput and short read lengths make the separation of reads from different species more challenging. Among the existing computational tools for NGS data, there are supervised methods that use reference databases to classify reads and unsupervised methods that use oligonucleotide patterns to cluster reads. The former may leave a large fraction of reads unclassified due to the absence of closely related references. The latter often rely on long oligonucleotide frequencies and are sensitive to species abundance levels. In this work, we present MarkovBin, a new unsupervised method that can accurately cluster metagenomic reads across various species abundance ratios. We first model the nucleotide sequences as a fixed-order Markov chain. We then propose a hierarchical distribution to model the dependency between paired-end reads. Finally, we employ the mixture model framework to separate reads from different genomes in a metagenomic dataset. Using extensive simulation data, we demonstrate a high accuracy and precision by comparing to selected unsupervised read clustering tools. The software is freely available at http://orleans.cs.wayne.edu/MarkovBin.

[1]  T. Urich,et al.  Archaea predominate among ammonia-oxidizing prokaryotes in soils , 2006, Nature.

[2]  Frank Oliver Glöckner,et al.  Current opportunities and challenges in microbial metagenome analysis—a bioinformatic perspective , 2012, Briefings Bioinform..

[3]  Jonathan A Eisen,et al.  Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes , 2007, PLoS biology.

[4]  M. Waterman,et al.  Estimating the repeat structure and length of DNA sequences using L-tuples. , 2003, Genome research.

[5]  Siu-Ming Yiu,et al.  MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample , 2012, Bioinform..

[6]  Frank Oliver Glöckner,et al.  TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences , 2004, BMC Bioinformatics.

[7]  Sophie Schbath,et al.  Exceptional Motifs in Different Markov Chain Models for a Statistical Analysis of DNA Sequences , 1995, J. Comput. Biol..

[8]  Lior Pachter,et al.  Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities , 2005, PLoS Comput. Biol..

[9]  J. Handelsman,et al.  Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. , 1998, Chemistry & biology.

[10]  C. Deming,et al.  Topographical and Temporal Diversity of the Human Skin Microbiome , 2009, Science.

[11]  G J McLachlan,et al.  Mixture modelling for cluster analysis , 2004, Statistical methods in medical research.

[12]  Siu-Ming Yiu,et al.  MetaCluster 4.0: A Novel Binning Algorithm for NGS Reads and Huge Number of Species , 2012, J. Comput. Biol..

[13]  Zhaojun Bai,et al.  CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads , 2007, RECOMB.

[14]  N. Pace,et al.  Rapid determination of 16S ribosomal RNA sequences for phylogenetic analyses. , 1985, Proceedings of the National Academy of Sciences of the United States of America.

[15]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[16]  Gabriel Valiente,et al.  Flexible taxonomic assignment of ambiguous sequencing reads , 2011, BMC Bioinformatics.

[17]  Mihai Pop,et al.  Genome assembly reborn: recent computational challenges , 2009, Briefings Bioinform..

[18]  Yan Boucher,et al.  Use of 16S rRNA and rpoB Genes as Molecular Markers for Microbial Ecology Studies , 2006, Applied and Environmental Microbiology.

[19]  D. Alland,et al.  A detailed analysis of 16S ribosomal RNA gene segments for the diagnosis of pathogenic bacteria. , 2007, Journal of microbiological methods.

[20]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[21]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[22]  G. McLachlan,et al.  Extensions of the EM Algorithm , 2007 .

[23]  Jonathan Dushoff,et al.  Unsupervised statistical clustering of environmental shotgun sequences , 2009, BMC Bioinformatics.

[24]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[25]  Benjamin J. Raphael,et al.  The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families , 2007, PLoS biology.

[26]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[27]  R. Amann,et al.  Application of tetranucleotide frequencies for the assignment of genomic fragments. , 2004, Environmental microbiology.

[28]  John C. Wooley,et al.  A Primer on Metagenomics , 2010, PLoS Comput. Biol..

[29]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[30]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[31]  H. Akaike A new look at the statistical model identification , 1974 .

[32]  R. Knight,et al.  Bacterial Community Variation in Human Body Habitats Across Space and Time , 2009, Science.

[33]  D. Bentley,et al.  Whole-genome re-sequencing. , 2006, Current opinion in genetics & development.

[34]  David Ussery,et al.  Investigations of Oligonucleotide Usage Variance Within and Between Prokaryotes , 2008, PLoS Comput. Biol..

[35]  S. Giovannoni,et al.  The uncultured microbial majority. , 2003, Annual review of microbiology.

[36]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[37]  Siu-Ming Yiu,et al.  A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio , 2011, Bioinform..

[38]  Shigehiko Kanaya,et al.  Informatics for unveiling hidden genome signatures. , 2003, Genome research.

[39]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[40]  Naryttza N. Diaz,et al.  TACOA – Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach , 2009, BMC Bioinformatics.

[41]  James R. Cole,et al.  The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis , 2004, Nucleic Acids Res..

[42]  BMC Bioinformatics , 2005 .

[43]  Steven Salzberg,et al.  Clustering metagenomic sequences with interpolated Markov models , 2010, BMC Bioinformatics.

[44]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[45]  M. Cugmas,et al.  On comparing partitions , 2015 .

[46]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[47]  Lawrence G. Wayne,et al.  International Committee on Systematic Bacteriology announcement of the report of the ad hoc Committee on Reconciliation of Approaches to Bacterial Systematics. , 1988, The Journal of applied bacteriology.

[48]  J. Eisen,et al.  A simple, fast, and accurate method of phylogenomic inference , 2008, Genome Biology.

[49]  Huzefa Rangwala,et al.  Evaluation of short read metagenomic assembly , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[50]  I. Rigoutsos,et al.  Accurate phylogenetic classification of variable-length DNA fragments , 2007, Nature Methods.