MetaPathways: a modular pipeline for constructing pathway/genome databases from environmental sequence information

BackgroundA central challenge to understanding the ecological and biogeochemical roles of microorganisms in natural and human engineered ecosystems is the reconstruction of metabolic interaction networks from environmental sequence information. The dominant paradigm in metabolic reconstruction is to assign functional annotations using BLAST. Functional annotations are then projected onto symbolic representations of metabolism in the form of KEGG pathways or SEED subsystems.ResultsHere we present MetaPathways, an open source pipeline for pathway inference that uses the PathoLogic algorithm to map functional annotations onto the MetaCyc collection of reactions and pathways, and construct environmental Pathway/Genome Databases (ePGDBs) compatible with the editing and navigation features of Pathway Tools. The pipeline accepts assembled or unassembled nucleotide sequences, performs quality assessment and control, predicts and annotates noncoding genes and open reading frames, and produces inputs to PathoLogic. In addition to constructing ePGDBs, MetaPathways uses MLTreeMap to build phylogenetic trees for selected taxonomic anchor and functional gene markers, converts General Feature Format (GFF) files into concatenated GenBank files for ePGDB construction based on third-party annotations, and generates useful file formats including Sequin files for direct GenBank submission and gene feature tables summarizing annotations, MLTreeMap trees, and ePGDB pathway coverage summaries for statistical comparisons.ConclusionsMetaPathways provides users with a modular annotation and analysis pipeline for predicting metabolic interaction networks from environmental sequence information using an alternative to KEGG pathways and SEED subsystems mapping. It is extensible to genomic and transcriptomic datasets from a wide range of sequencing platforms, and generates useful data products for microbial community structure and function analysis. The MetaPathways software package, installation instructions, and example data can be obtained from http://hallam.microbiology.ubc.ca/MetaPathways.

[1]  Kevin P. Keegan,et al.  Predicted Relative Metabolomic Turnover (PRMT): determining metabolic turnover from a coastal marine metagenomic dataset , 2011, Microbial Informatics and Experimentation.

[2]  Wolfgang Gentzsch,et al.  Sun Grid Engine: towards creating a compute power grid , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[3]  Anantharaman Kalyanaraman,et al.  MapReduce implementation of a hybrid spectral library-database search method for large-scale peptide identification , 2011, Bioinform..

[4]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[5]  Naryttza N. Diaz,et al.  The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes , 2005, Nucleic acids research.

[6]  Benjamin J. Raphael,et al.  The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families , 2007, PLoS biology.

[7]  I-Min A. Chen,et al.  IMG/M: the integrated metagenome data management and comparative analysis system , 2011, Nucleic Acids Res..

[8]  Andreas Wilke,et al.  phylogenetic and functional analysis of metagenomes , 2022 .

[9]  Yuzhen Ye,et al.  A Parsimony Approach to Biological Pathway Reconstruction/Inference for Genomes and Metagenomes , 2009, PLoS Comput. Biol..

[10]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[11]  Sallie W. Chisholm,et al.  Emergent Biogeography of Microbial Communities in a Model Ocean , 2007, Science.

[12]  BMC Bioinformatics , 2005 .

[13]  C. Claudel-Renard,et al.  Enzyme-specific profiles for genome annotation: PRIAM. , 2003, Nucleic acids research.

[14]  Peter E. Larsen,et al.  Predicting bacterial community assemblages using an artificial neural network approach. , 2012, Methods in molecular biology.

[15]  Peter D. Karp,et al.  The Pathway Tools software , 2002, ISMB.

[16]  Folker Meyer,et al.  37. The Metagenomics RAST Server: A Public Resource for the Automatic Phylogenetic and Functional Analysis of Metagenomes , 2011 .

[17]  Daniel H. Huson,et al.  48. MetaSim: A Sequencing Simulator for Genomics and Metagenomics , 2011 .

[18]  A. Salamov,et al.  Use of simulated data sets to evaluate the fidelity of metagenomic processing methods , 2007, Nature Methods.

[19]  Rick L. Stevens,et al.  High-throughput generation, optimization and analysis of genome-scale metabolic models , 2010, Nature Biotechnology.

[20]  W. Ludwig,et al.  SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB , 2007, Nucleic acids research.

[21]  E. Delong,et al.  The Microbial Engines That Drive Earth's Biogeochemical Cycles , 2008, Science.

[22]  Alla Lapidus,et al.  A Bioinformatician's Guide to Metagenomics , 2008, Microbiology and Molecular Biology Reviews.

[23]  Peter D Karp,et al.  Browsing metabolic and regulatory networks with BioCyc. , 2012, Methods in molecular biology.

[24]  S. Kravitz,et al.  CAMERA: A Community Resource for Metagenomics , 2007, PLoS biology.

[25]  Peter D. Karp,et al.  The EcoCyc and MetaCyc databases , 2000, Nucleic Acids Res..

[26]  Peer Bork,et al.  KEGG Atlas mapping for global analysis of metabolic pathways , 2008, Nucleic Acids Res..

[27]  Shibu Yooseph,et al.  A Case Study for Large-Scale Human Microbiome Analysis Using JCVI’s Metagenomics Reports (METAREP) , 2012, PloS one.

[28]  Peter E. Larsen,et al.  Predicting bacterial community assemblages using an artificial neural network approach , 2012, Nature Methods.

[29]  S. Eddy,et al.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. , 1997, Nucleic acids research.

[30]  Jacques Ravel,et al.  Visualization of comparative genomic analyses by BLAST score ratio , 2005, BMC Bioinformatics.

[31]  R. Overbeek,et al.  FIGfams: yet another set of protein families , 2009, Nucleic acids research.

[32]  Peter D. Karp,et al.  An advanced web query interface for biological databases , 2010, Database J. Biol. Databases Curation.

[33]  Rick L. Stevens,et al.  Connecting genotype to phenotype in the era of high-throughput sequencing. , 2011, Biochimica et biophysica acta.

[34]  John C. Wooley,et al.  A Primer on Metagenomics , 2010, PLoS Comput. Biol..

[35]  Stephanie Dutkiewicz,et al.  Patterns of Diversity in Marine Phytoplankton , 2010, Science.

[36]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[37]  Rick L. Stevens,et al.  The RAST Server: Rapid Annotations using Subsystems Technology , 2008, BMC Genomics.

[38]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[39]  M. Frith,et al.  Adaptive seeds tame genomic sequence comparison. , 2011, Genome research.

[40]  Peter D. Karp,et al.  Machine learning methods for metabolic pathway prediction , 2010 .

[41]  Suzanne M. Paley,et al.  The Pathway Tools cellular overview diagram and Omics Viewer , 2006, Nucleic acids research.

[42]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[43]  Peter D. Karp,et al.  Construction and completion of flux balance models from pathway databases , 2012, Bioinform..

[44]  Edward F DeLong Towards microbial systems science: integrating microbial perspective, from genomes to biomes. , 2002, Environmental microbiology.

[45]  Kishori M. Konwar,et al.  Microbial ecology of expanding oxygen minimum zones , 2012, Nature Reviews Microbiology.

[46]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[47]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[48]  A. Stamatakis,et al.  MLTreeMap - accurate Maximum Likelihood placement of environmental DNA sequences into taxonomic and functional reference phylogenies , 2010, BMC Genomics.

[49]  Bernard Henrissat,et al.  Metabolic Reconstruction for Metagenomic Data and Its Application to the Human Microbiome , 2012, PLoS Comput. Biol..

[50]  Hiroaki Kitano,et al.  The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models , 2003, Bioinform..

[51]  I-Min A. Chen,et al.  IMG/M: a data management and analysis system for metagenomes , 2007, Nucleic Acids Res..

[52]  Peter D. Karp,et al.  Pathway Tools version 13.0: integrated software for pathway/genome informatics and systems biology , 2015, Briefings Bioinform..

[53]  Donna R. Maglott,et al.  RefSeq and LocusLink: NCBI gene-centered resources , 2001, Nucleic Acids Res..

[54]  Peter D. Karp,et al.  The Pathway Tools Pathway Prediction Algorithm , 2011, Standards in genomic sciences.

[55]  Srinivas Aluru,et al.  Efficient clustering of large EST data sets on parallel computers. , 2003, Nucleic acids research.

[56]  Michael Y. Galperin,et al.  The COG database: new developments in phylogenetic classification of proteins from complete genomes , 2001, Nucleic Acids Res..