Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models

Metagenomics projects collect DNA from uncharacterized environments that may contain thousands of species per sample. One main challenge facing metagenomic analysis is phylogenetic classification of raw sequence reads into groups representing the same or similar taxa, a prerequisite for genome assembly and for analyzing the biological diversity of a sample. New sequencing technologies have made metagenomics easier, by making sequencing faster, and more difficult, by producing shorter reads than previous technologies. Classifying sequences from reads as short as 100 base pairs has until now been relatively inaccurate, requiring researchers to use older, long-read technologies. We present Phymm, a classifier for metagenomic data, that has been trained on 539 complete, curated genomes and can accurately classify reads as short as 100 base pairs, a substantial improvement over previous composition-based classification methods. We also describe how combining Phymm with sequence alignment algorithms improves accuracy.

[1]  K. Schleifer,et al.  Oxidation of inorganic nitrogen compounds as energy source. , 1992 .

[2]  S. Karlin,et al.  Dinucleotide relative abundance extremes: a genomic signature. , 1995, Trends in genetics : TIG.

[3]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[4]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[5]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[6]  J. Handelsman,et al.  Cloning the Soil Metagenome: a Strategy for Accessing the Genetic and Functional Diversity of Uncultured Microorganisms , 2000, Applied and Environmental Microbiology.

[7]  S. Salzberg,et al.  Using MUMmer to Identify Similar Regions in Large Sequence Sets , 2003, Current protocols in bioinformatics.

[8]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[9]  S. Tringe,et al.  Comparative Metagenomics of Microbial Communities , 2004, Science.

[10]  Alain Giron,et al.  Exploration of phylogenetic data using a global sequence analysis method , 2005, BMC Evolutionary Biology.

[11]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[12]  I. Rigoutsos,et al.  Accurate phylogenetic classification of variable-length DNA fragments , 2007, Nature Methods.

[13]  Steven Salzberg,et al.  Identifying bacterial genes and endosymbiont DNA with Glimmer , 2007, Bioinform..

[14]  A. Salamov,et al.  Use of simulated data sets to evaluate the fidelity of metagenomic processing methods , 2007, Nature Methods.

[15]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[16]  Florent E. Angly,et al.  Microbial Ecology of Four Coral Atolls in the Northern Line Islands , 2008, PloS one.

[17]  Ping Wang,et al.  Phylotyping and Functional Analysis of Two Ancient Human Microbiomes , 2008, PloS one.

[18]  Naryttza N. Diaz,et al.  Phylogenetic classification of short environmental DNA fragments , 2008, Nucleic acids research.

[19]  Alla Lapidus,et al.  A Bioinformatician's Guide to Metagenomics , 2008, Microbiology and Molecular Biology Reviews.

[20]  Michael Roberts,et al.  Figaro: a novel statistical method for vector sequence removal , 2008, Bioinform..

[21]  Roderic Guigo,et al.  A comparison of random sequence reads versus 16S rDNA sequences for estimating the biodiversity of a metagenomic library , 2008, Nucleic acids research.