Systematic Identification of Gene Families for Use as “Markers” for Phylogenetic and Phylogeny-Driven Ecological Studies of Bacteria and Archaea and Their Major Subgroups

With the astonishing rate that genomic and metagenomic sequence data sets are accumulating, there are many reasons to constrain the data analyses. One approach to such constrained analyses is to focus on select subsets of gene families that are particularly well suited for the tasks at hand. Such gene families have generally been referred to as “marker” genes. We are particularly interested in identifying and using such marker genes for phylogenetic and phylogeny-driven ecological studies of microbes and their communities (e.g., construction of species trees, phylogenetic based assignment of metagenomic sequence reads to taxonomic groups, phylogeny-based assessment of alpha- and beta-diversity of microbial communities from metagenomic data). We therefore refer to these as PhyEco (for phylogenetic and phylogenetic ecology) markers. The dual use of these PhyEco markers means that we needed to develop and apply a set of somewhat novel criteria for identification of the best candidates for such markers. The criteria we focused on included universality across the taxa of interest, ability to be used to produce robust phylogenetic trees that reflect as much as possible the evolution of the species from which the genes come, and low variation in copy number across taxa. We describe here an automated protocol for identifying potential PhyEco markers from a set of complete genome sequences. The protocol combines rapid searching, clustering and phylogenetic tree building algorithms to generate protein families that meet the criteria listed above. We report here the identification of PhyEco markers for different taxonomic levels including 40 for “all bacteria and archaea”, 114 for “all bacteria (greatly expanding on the ∼30 commonly used), and 100 s to 1000 s for some of the individual phyla of bacteria. This new list of PhyEco markers should allow much more detailed automated phylogenetic and phylogenetic ecology analyses of these groups than possible previously.

[1]  Martin Wu,et al.  A phylum-level bacterial phylogenetic marker database. , 2013, Molecular biology and evolution.

[2]  Jonathan A. Eisen,et al.  Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource , 2012, BMC Bioinformatics.

[3]  Jonathan A. Eisen,et al.  Incorporating 16S Gene Copy Number Information Improves Estimates of Microbial Diversity and Abundance , 2012, PLoS Comput. Biol..

[4]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[5]  Peter Williams,et al.  IMG: the integrated microbial genomes database and comparative analysis system , 2011, Nucleic Acids Res..

[6]  D. W. Cheng,et al.  The minimal genome: a metabolic and environmental comparison. , 2011, Briefings in functional genomics.

[7]  A. Halpern,et al.  Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, and Interpreting Novel, Deep Branches in Marker Gene Phylogenetic Trees , 2011, PloS one.

[8]  Manolo Gouy,et al.  Detecting lateral gene transfers by statistical reconciliation of phylogenetic forests , 2010, BMC Bioinformatics.

[9]  Wendy S. Schackwitz,et al.  One Bacterial Cell, One Complete Genome , 2010, PloS one.

[10]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[11]  Natalia N. Ivanova,et al.  A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea , 2009, Nature.

[12]  Sean R Eddy,et al.  A new generation of homology search tools based on probabilistic inference. , 2009, Genome informatics. International Conference on Genome Informatics.

[13]  Sean R. Eddy,et al.  Infernal 1.0: inference of RNA alignments , 2009, Bioinform..

[14]  J. Eisen,et al.  Assembling the Marine Metagenome, One Cell at a Time , 2009, PloS one.

[15]  James R. Cole,et al.  The Ribosomal Database Project: improved alignments and new tools for rRNA analysis , 2008, Nucleic Acids Res..

[16]  O. Gascuel,et al.  Estimating maximum likelihood phylogenies with PhyML. , 2009, Methods in molecular biology.

[17]  Robert L Charlebois,et al.  The Impact of Reticulate Evolution on Genome Phylogeny , 2008 .

[18]  J. Eisen,et al.  A simple, fast, and accurate method of phylogenomic inference , 2008, Genome Biology.

[19]  Jonathan A. Eisen,et al.  Correction: An Automated Phylogenetic Tree-Based Small Subunit rRNA Taxonomy and Alignment Pipeline (STAP) , 2008, PLoS ONE.

[20]  J. Eisen,et al.  An Automated Phylogenetic Tree-Based Small Subunit rRNA Taxonomy and Alignment Pipeline (STAP) , 2008, PloS one.

[21]  W. Ludwig,et al.  SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB , 2007, Nucleic acids research.

[22]  Michael Y. Galperin,et al.  The cyanobacterial genome core and the origin of photosynthesis , 2006, Proceedings of the National Academy of Sciences.

[23]  Phat L Tran,et al.  Metabolic Complementarity and Genomics of the Dual Bacterial Symbiosis of Sharpshooters , 2006, PLoS biology.

[24]  B. Snel,et al.  Toward Automatic Reconstruction of a Highly Resolved Tree of Life , 2006, Science.

[25]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[26]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[27]  Vanja Klepac-Ceraj,et al.  PCR-Induced Sequence Artifacts and Bias: Insights from Comparison of Two 16S rRNA Clone Libraries Constructed from the Same Sample , 2005, Applied and Environmental Microbiology.

[28]  D. Gevers,et al.  Re-evaluating prokaryotic species , 2005, Nature Reviews Microbiology.

[29]  C. Woese,et al.  An ancient divergence among the bacteria , 1977, Journal of Molecular Evolution.

[30]  Tom Coenye,et al.  Opinion: Re-evaluating prokaryotic species. , 2005, Nature reviews. Microbiology.

[31]  A. Moya,et al.  Determination of the Core of a Minimal Bacterial Gene Set , 2004, Microbiology and Molecular Biology Reviews.

[32]  Jie Luo,et al.  RPB2 gene phylogeny in flowering plants, with particular emphasis on asterids. , 2004, Molecular phylogenetics and evolution.

[33]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[34]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[35]  Jonathan A. Eisen,et al.  The RecA protein as a model molecule for molecular systematic studies of bacteria: Comparison of trees of RecAs and 16S rRNAs from the same species , 1995, Journal of Molecular Evolution.

[36]  Andrew T. Lloyd,et al.  Evolution of the recA gene and the molecular phylogeny of bacteria , 1993, Journal of Molecular Evolution.

[37]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[38]  E. Egelman,et al.  Comparison of bacteriophage T4 UvsX and human Rad51 filaments suggests that RecA-like polymers may have evolved independently. , 2001, Journal of molecular biology.

[39]  James R. Cole,et al.  rrndb: the Ribosomal RNA Operon Copy Number Database , 2001, Nucleic Acids Res..

[40]  M. Hattori,et al.  Genome sequence of the endocellular bacterial symbiont of aphids Buchnera sp. APS , 2000, Nature.

[41]  T. Schmidt,et al.  rRNA Operon Copy Number Reflects Ecological Strategies of Bacteria , 2000, Applied and Environmental Microbiology.

[42]  Zhenshui Zhang,et al.  Distinct Types of rRNA Operons Exist in the Genome of the Actinomycete Thermomonospora chromogena and Evidence for Horizontal Transfer of an Entire rRNA Operon , 1999, Journal of bacteriology.

[43]  E. Delong,et al.  Diversity of radA Genes from Cultured and UnculturedArchaea: Comparative Analysis of Putative RadA Proteins and Their Use as a Phylogenetic Marker , 1999, Journal of bacteriology.

[44]  Philip Hugenholtz,et al.  Impact of Culture-Independent Studies on the Emerging Phylogenetic View of Bacterial Diversity , 1998, Journal of bacteriology.

[45]  Didier Raoult,et al.  rpoB sequence analysis as a novel basis for bacterial identification , 1997, Molecular microbiology.

[46]  C E Shannon,et al.  The mathematical theory of communication. 1963. , 1997, M.D. computing : computers in medical practice.

[47]  N. Pace A molecular view of microbial diversity and the biosphere. , 1997, Science.

[48]  J. Palmer,et al.  Isolation and characterization of rad51 orthologs from Coprinus cinereus and Lycopersicon esculentum, and phylogenetic analysis of eukaryotic recA homologs , 1997, Current Genetics.

[49]  H. Hilbert,et al.  Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae. , 1996, Nucleic acids research.

[50]  E. Koonin,et al.  A minimal gene set for cellular life derived by comparison of complete bacterial genomes. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[51]  Masami Hasegawa,et al.  Ribosomal RNA trees misleading? , 1993, Nature.

[52]  S. Goodison,et al.  16S ribosomal DNA amplification for phylogenetic study , 1991, Journal of bacteriology.

[53]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[54]  R A Garrett,et al.  Archaebacterial DNA-dependent RNA polymerases testify to the evolution of the eukaryotic nuclear genome. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[55]  N. Pace,et al.  Rapid determination of 16S ribosomal RNA sequences for phylogenetic analyses. , 1985, Proceedings of the National Academy of Sciences of the United States of America.

[56]  C R Woese,et al.  The phylogeny of prokaryotes. , 1980, Microbiological sciences.

[57]  C. Woese,et al.  Phylogenetic structure of the prokaryotic domain: The primary kingdoms , 1977, Proceedings of the National Academy of Sciences of the United States of America.