Clustering metagenomic sequences with interpolated Markov models

BackgroundSequencing of environmental DNA (often called metagenomics) has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods. Because the output from metagenomic sequencing is a large set of reads of unknown origin, clustering reads together that were sequenced from the same species is a crucial analysis step. Many effective approaches to this task rely on sequenced genomes in public databases, but these genomes are a highly biased sample that is not necessarily representative of environments interesting to many metagenomics projects.ResultsWe present SCIMM (Sequence Clustering with Interpolated Markov Models), an unsupervised sequence clustering method. SCIMM achieves greater clustering accuracy than previous unsupervised approaches. We examine the limitations of unsupervised learning on complex datasets, and suggest a hybrid of SCIMM and supervised learning method Phymm called PHY SCIMM that performs better when evolutionarily close training genomes are available.ConclusionsSCIMM and PHY SCIMM are highly accurate methods to cluster metagenomic sequences. SCIMM operates entirely unsupervised, making it ideal for environments containing mostly novel microbes. PHY SCIMM uses supervised learning to improve clustering in environments containing microbial strains from well-characterized genera. SCIMM and PHY SCIMM are available open source from http://www.cbcb.umd.edu/software/scimm.

[1]  J. Mrázek Phylogenetic signals in DNA composition: limitations and prospects. , 2009, Molecular biology and evolution.

[2]  Steven Salzberg,et al.  Identifying bacterial genes and endosymbiont DNA with Glimmer , 2007, Bioinform..

[3]  G. Celeux,et al.  A Classification EM algorithm for clustering and two stochastic versions , 1992 .

[4]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[5]  H. Ochman,et al.  Amelioration of Bacterial Genomes: Rates of Change and Exchange , 1997, Journal of Molecular Evolution.

[6]  Scott Mann,et al.  Bacterial genomic G+C composition-eliciting environmental adaptation. , 2010, Genomics.

[7]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[8]  C. Deming,et al.  Topographical and Temporal Diversity of the Human Skin Microbiome , 2009, Science.

[9]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[10]  Lior Pachter,et al.  Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities , 2005, PLoS Comput. Biol..

[11]  R. Amann,et al.  Application of tetranucleotide frequencies for the assignment of genomic fragments. , 2004, Environmental microbiology.

[12]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  L. Koski,et al.  The Closest BLAST Hit Is Often Not the Nearest Neighbor , 2001, Journal of Molecular Evolution.

[14]  Shigehiko Kanaya,et al.  Informatics for unveiling hidden genome signatures. , 2003, Genome research.

[15]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[16]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[17]  Natalia N. Ivanova,et al.  A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea , 2009, Nature.

[18]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[19]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[20]  I. Rigoutsos,et al.  Accurate phylogenetic classification of variable-length DNA fragments , 2007, Nature Methods.

[21]  A. Hsu,et al.  Using Growing Self-Organising Maps to Improve the Binning Process in Environmental Whole-Genome Shotgun Sequencing , 2007, Journal of biomedicine & biotechnology.

[22]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[23]  Peter Salamon,et al.  Viral and microbial community dynamics in four aquatic environments , 2010, The ISME Journal.

[24]  Alice C McHardy,et al.  What's in the mix: phylogenetic classification of metagenome sequence samples. , 2007, Current opinion in microbiology.

[25]  R. Knight,et al.  Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. , 2009, Genome research.

[26]  J. Eisen,et al.  A simple, fast, and accurate method of phylogenomic inference , 2008, Genome Biology.

[27]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[28]  R. Sandberg,et al.  Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. , 2001, Genome research.

[29]  S Karlin,et al.  Compositional biases of bacterial genomes and evolutionary implications , 1997, Journal of bacteriology.

[30]  A. Salamov,et al.  Use of simulated data sets to evaluate the fidelity of metagenomic processing methods , 2007, Nature Methods.

[31]  Wolfgang Gerlach,et al.  WebCARMA: a web application for the functional and taxonomic classification of unassembled metagenomic reads , 2009, BMC Bioinformatics.

[32]  Sean R. Eddy,et al.  Biological sequence analysis: Contents , 1998 .

[33]  Padhraic Smyth,et al.  Clustering Sequences with Hidden Markov Models , 1996, NIPS.

[34]  Jon Bohlin,et al.  Reliability and applications of statistical methods based on oligonucleotide frequencies in bacterial and archaeal genomes , 2008, BMC Genomics.

[35]  Anton van Leeuwenhoek Environmental Shotgun Sequencing : Its Potential and Challenges for Studying the Hidden World of Microbes , 2007 .

[36]  Naryttza N. Diaz,et al.  TACOA – Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach , 2009, BMC Bioinformatics.

[37]  Shang-Jung Lee,et al.  Genomic Conflict Settled in Favour of the Species Rather Than the Gene at Extreme GC Percentage Values , 2004, Applied bioinformatics.

[38]  Jonathan A Eisen,et al.  Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes , 2007, PLoS biology.

[39]  Sergei L. Kosakovsky Pond,et al.  Windshield splatter analysis with the Galaxy metagenomic pipeline. , 2009, Genome research.

[40]  Brian C. Thomas,et al.  Community-wide analysis of microbial genome sequence signatures , 2009, Genome Biology.

[41]  R. Knight,et al.  Bacterial Community Variation in Human Body Habitats Across Space and Time , 2009, Science.

[42]  David Ussery,et al.  Investigations of Oligonucleotide Usage Variance Within and Between Prokaryotes , 2008, PLoS Comput. Biol..

[43]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[44]  I-Min A. Chen,et al.  The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata , 2007, Nucleic Acids Res..

[45]  Zhaojun Bai,et al.  CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads , 2007, RECOMB.

[46]  A. Halpern,et al.  The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific , 2007, PLoS biology.

[47]  Monzoorul Haque Mohammed,et al.  SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences , 2009, Bioinform..

[48]  Jonathan Dushoff,et al.  Unsupervised statistical clustering of environmental shotgun sequences , 2009, BMC Bioinformatics.

[49]  Hideaki Sugawara,et al.  Novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples. , 2005, DNA research : an international journal for rapid publication of reports on genes and genomes.

[50]  Jon Bohlin,et al.  Analysis of genomic signatures in prokaryotes using multinomial regression and hierarchical clustering , 2009, BMC Genomics.

[51]  J. Eisen,et al.  Metagenomic Sequencing of an In Vitro-Simulated Microbial Community , 2010, PloS one.

[52]  Mihai Pop,et al.  Finding Biologically Accurate Clusterings in Hierarchical Tree Decompositions Using the Variation of Information , 2009, J. Comput. Biol..

[53]  Mihai Pop,et al.  Alignment and clustering of phylogenetic markers - implications for microbial diversity studies , 2010, BMC Bioinformatics.