Phylogeny-aware identification and correction of taxonomically mislabeled sequences

Abstract Molecular sequences in public databases are mostly annotated by the submitting authors without further validation. This procedure can generate erroneous taxonomic sequence labels. Mislabeled sequences are hard to identify, and they can induce downstream errors because new sequences are typically annotated using existing ones. Furthermore, taxonomic mislabelings in reference sequence databases can bias metagenetic studies which rely on the taxonomy. Despite significant efforts to improve the quality of taxonomic annotations, the curation rate is low because of the labor-intensive manual curation process. Here, we present SATIVA, a phylogeny-aware method to automatically identify taxonomically mislabeled sequences (‘mislabels’) using statistical models of evolution. We use the Evolutionary Placement Algorithm (EPA) to detect and score sequences whose taxonomic annotation is not supported by the underlying phylogenetic signal, and automatically propose a corrected taxonomic classification for those. Using simulated data, we show that our method attains high accuracy for identification (96.9% sensitivity/91.7% precision) as well as correction (94.9% sensitivity/89.9% precision) of mislabels. Furthermore, an analysis of four widely used microbial 16S reference databases (Greengenes, LTP, RDP and SILVA) indicates that they currently contain between 0.2% and 2.5% mislabels. Finally, we use SATIVA to perform an in-depth evaluation of alternative taxonomies for Cyanobacteria. SATIVA is freely available at https://github.com/amkozlov/sativa.

[1]  N. Sarvetnick,et al.  Type 1 diabetes: role of intestinal microbiome in humans and mice , 2011, Annals of the New York Academy of Sciences.

[2]  James R. Cole,et al.  Ribosomal Database Project: data and tools for high throughput rRNA analysis , 2013, Nucleic Acids Res..

[3]  Pelin Yilmaz,et al.  The SILVA ribosomal RNA gene database project: improved data processing and web-based tools , 2012, Nucleic Acids Res..

[4]  Andy F. S. Taylor,et al.  The UNITE database for molecular identification of fungi--recent updates and future perspectives. , 2010, The New phytologist.

[5]  Alison S. Waller,et al.  Genomic variation landscape of the human gut microbiome , 2012, Nature.

[6]  Michael R. Kosorok,et al.  Detection of gene pathways with predictive power for breast cancer prognosis , 2010, BMC Bioinformatics.

[7]  O. Kandler,et al.  Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Elmar Pruesse,et al.  SINA: Accurate high-throughput multiple sequence alignment of ribosomal RNA genes , 2012, Bioinform..

[9]  Denis Krompass,et al.  Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads under Maximum Likelihood , 2011, Systematic biology.

[10]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[11]  Erik Kristiansson,et al.  Mining metadata from unidentified ITS sequences in GenBank: A case study in Inocybe (Basidiomycota) , 2008, BMC Evolutionary Biology.

[12]  Alexandros Stamatakis,et al.  Evolutionary placement of short sequence reads on multi-core architectures , 2010, ACS/IEEE International Conference on Computer Systems and Applications - AICCSA 2010.

[13]  R. Henrik Nilsson,et al.  Improving ITS sequence data for identification of plant pathogenic fungi , 2014, Fungal Diversity.

[14]  A. Kostic,et al.  The microbiome in inflammatory bowel disease: current status and the future ahead. , 2014, Gastroenterology.

[15]  J. Chun,et al.  Introducing EzTaxon-e: a prokaryotic 16S rRNA gene sequence database with phylotypes that represent uncultured species. , 2012, International journal of systematic and evolutionary microbiology.

[16]  G. Hong,et al.  Nucleic Acids Research , 2015, Nucleic Acids Research.

[17]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[18]  Stéphane Audic,et al.  The Protist Ribosomal Reference database (PR2): a catalog of unicellular eukaryote Small Sub-Unit rRNA sequences with curated taxonomy , 2012, Nucleic Acids Res..

[19]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[20]  Morris A. Swertz,et al.  The MOLGENIS toolkit: rapid prototyping of biosoftware at the push of a button , 2010, BMC Bioinformatics.

[21]  K. Strimmer,et al.  Inferring confidence sets of possibly misspecified gene trees , 2002, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[22]  A. Oren Naming Cyanophyta/Cyanobacteria - a bacteriologist's view. , 2011 .

[23]  Joaquín Dopazo,et al.  ETE: a python Environment for Tree Exploration , 2010, BMC Bioinformatics.

[24]  D. Greig,et al.  Prezygotic reproductive isolation between Saccharomyces cerevisiae and Saccharomyces paradoxus , 2008, BMC Evolutionary Biology.

[25]  R. Henrik Nilsson,et al.  Five simple guidelines for establishing basic authenticity and reliability of newly generated fungal ITS sequences. , 2012 .

[26]  Alexandros Stamatakis,et al.  Phylogenetic Search Algorithms for Maximum Likelihood , 2010 .

[27]  K. Schleifer,et al.  Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences , 2014, Nature Reviews Microbiology.

[28]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[29]  D. Bass,et al.  Policy, phylogeny, and the parasite. , 2014, Trends in parasitology.

[30]  Ych-chu Wang Molecular ecology , 1992, Journal of Northeast Forestry University.

[31]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[32]  D. Rosen The role of taxonomy in effective biological control programs , 1986 .

[33]  Thomas P. Curtis,et al.  Estimating prokaryotic diversity and its limits , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Rob Knight,et al.  The Earth Microbiome project: successes and aspirations , 2014, BMC Biology.

[35]  T. Skopek,et al.  Deletion mutagenesis during polymerase chain reaction: dependence on DNA polymerase. , 1991, Gene.

[36]  K. Schleifer,et al.  The All-Species Living Tree project: a 16S rRNA-based phylogenetic tree of all sequenced type strains. , 2008, Systematic and applied microbiology.

[37]  Michael Weiss,et al.  Towards a unified paradigm for sequence‐based identification of fungi , 2013, Molecular ecology.

[38]  A. Shuldiner,et al.  Hybrid DNA artifact from PCR of closely related target sequences. , 1989, Nucleic acids research.

[39]  Aidan C. Parte,et al.  LPSN—list of prokaryotic names with standing in nomenclature , 2013, Nucleic Acids Res..

[40]  Rob Knight,et al.  UCHIME improves sensitivity and speed of chimera detection , 2011, Bioinform..

[41]  J. Johansen,et al.  Taxonomic classification of cyanoprokaryotes (cyanobacterial genera) 2014, using a polyphasic approach , 2014 .

[42]  S. Tringe,et al.  Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments , 2007, Science.

[43]  Walter R. Gilks,et al.  Modeling the percolation of annotation errors in a database of protein sequences , 2002, Bioinform..

[44]  N. Pace A molecular view of microbial diversity and the biosphere. , 1997, Science.

[45]  Sharon I. Greenblum,et al.  Metagenomic systems biology of the human gut microbiome reveals topological shifts associated with obesity and inflammatory bowel disease , 2011, Proceedings of the National Academy of Sciences.

[46]  K. Schleifer,et al.  ARB: a software environment for sequence data. , 2004, Nucleic acids research.

[47]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[48]  Eric P. Nawrocki,et al.  An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea , 2011, The ISME Journal.

[49]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[50]  John P Sumpter,et al.  Populations of a cyprinid fish are self-sustaining despite widespread feminization of males , 2014, BMC Biology.