A two-way multi-dimensional mixture model for clustering metagenomic sequences

Motivation: A major challenge facing metagenomics is the development of tools for the characterization of functional and taxonomic content of vast amounts of short metagenome reads. The efficacy of clustering methods depends on the number of reads in the dataset, the read length and relative abundances of source genomes in the microbial community. Results: In this paper, we formulate an unsupervised naive Bayes multi-species, multi-dimensional mixture model for reads from a metagenome. We use the proposed model to cluster metagenomic reads by their species of origin and to characterize the abundance of each species. We model the distribution of word counts along a genome as a Gaussian for shorter, frequent words and as a Poisson for longer words that are rare. We employ either a mixture of Gaussians or mixture of Poissons to model reads within each bin. An additional reason to use these distributions is their flexibility and ease of parameter estimation. Such a paradigm characterizes the compositional heterogeneity of the words along a genome, signifying its genome signature. Further, we handle the high-dimensionality and sparsity associated with the data, by grouping the set of words comprising the reads, resulting in a two-way mixture model. Finally, we derive an unsupervised Expectation Maximization algorithm for the models. Our method provides a general statistical framework for modeling metagenome reads. We demonstrate the accuracy and applicability of this method on simulated and real metagenomes. Our method can accurately cluster reads as short as 100 bps and estimate the species abundance as well. Our method outperforms LikelyBin, another unsupervised composition-based binning method for metagenomes, on datasets of varying abundances, divergences and read lengths.

[1]  S. Giovannoni,et al.  The uncultured microbial majority. , 2003, Annual review of microbiology.

[2]  Elena Marchiori,et al.  Clustering Metagenome Short Reads Using Weighted Proteins , 2009, EvoBIO.

[3]  Lior Pachter,et al.  Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities , 2005, PLoS Comput. Biol..

[4]  J. Parkhill,et al.  Comparative genomic structure of prokaryotes. , 2004, Annual review of genetics.

[5]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[6]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[7]  I. Rigoutsos,et al.  Accurate phylogenetic classification of variable-length DNA fragments , 2007, Nature Methods.

[8]  Sean D. Hooper,et al.  Annotation of metagenome short reads using proxygenes , 2008, ECCB.

[9]  Steven Salzberg,et al.  Clustering metagenomic sequences with interpolated Markov models , 2010, BMC Bioinformatics.

[10]  Jonathan Dushoff,et al.  Unsupervised statistical clustering of environmental shotgun sequences , 2009, BMC Bioinformatics.

[11]  Hongyuan Zha,et al.  Computational Statistics Data Analysis , 2021 .

[12]  P. Deschavanne,et al.  Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. , 1999, Molecular biology and evolution.

[13]  A. Salamov,et al.  Use of simulated data sets to evaluate the fidelity of metagenomic processing methods , 2007, Nature Methods.

[14]  Fabian Schreiber,et al.  Treephyler: fast taxonomic profiling of metagenomes , 2010, Bioinform..

[15]  Mihai Pop,et al.  MetaPhyler: Taxonomic profiling for metagenomic sequences , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[16]  Frank Oliver Glöckner,et al.  TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences , 2004, BMC Bioinformatics.

[17]  Zhaojun Bai,et al.  CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads , 2007, RECOMB.

[18]  Nairanjana Dasgupta DNA, Words and Models, Statistics of Exceptional Words , 2007, Technometrics.

[19]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[20]  J. Beckmann,et al.  Linguistics of nucleotide sequences: morphology and comparison of vocabularies. , 1986, Journal of biomolecular structure & dynamics.

[21]  R. Elton,et al.  Doublet frequency analysis of fractionated vertebrate nuclear DNA. , 1976, Journal of molecular biology.

[22]  Antoine Danchin,et al.  Bio::NEXUS: a Perl API for the NEXUS format for comparative biological data , 2006, BMC Bioinformatics.

[23]  Yang Song,et al.  Real-time automatic tag recommendation , 2008, SIGIR '08.

[24]  S. Karlin,et al.  Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[25]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[26]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[27]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[28]  James R. Cole,et al.  The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis , 2004, Nucleic Acids Res..

[29]  Saman K. Halgamuge,et al.  BMC Bioinformatics BioMed Central Methodology article Binning sequences using very sparse labels within a metagenome , 2008 .

[30]  J. Josse,et al.  Enzymatic synthesis of deoxyribonucleic acid. VIII. Frequencies of nearest neighbor base sequences in deoxyribonucleic acid. , 1961, The Journal of biological chemistry.

[31]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[32]  S Karlin,et al.  Heterogeneity of genomes: measures and values. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[33]  F.C. Harris,et al.  A fuzzy classifier to taxonomically group DNA fragments within a metagenome , 2008, NAFIPS 2008 - 2008 Annual Meeting of the North American Fuzzy Information Processing Society.

[34]  Gail L. Rosen,et al.  Metagenome Fragment Classification Using N-Mer Frequency Profiles , 2008, Adv. Bioinformatics.

[35]  R. Amann,et al.  Application of tetranucleotide frequencies for the assignment of genomic fragments. , 2004, Environmental microbiology.

[36]  Robert Tibshirani,et al.  Cluster Validation by Prediction Strength , 2005 .