SWARM: a scientific workflow for supporting bayesian approaches to improve metabolic models

With the exponential growth of complete genome sequences, the analysis of these sequences is becoming a powerful approach to build genome-scale metabolic models. These models can be used to study individual molecular components and their relationships, and eventually study cells as systems. However, constructing genome-scale metabolic models manually is time-consuming and labor-intensive. This property of manual model-building process causes the fact that much fewer genome-scale metabolic models are available comparing to hundreds of genome sequences available. To tackle this problem, we design SWARM, a scientific workflow that can be utilized to improve genome-scale metabolic models in high-throughput fashion. SWARM deals with a range of issues including the integration of data across distributed resources, data format conversions, data update, and data provenance. Putting altogether, SWARM streamlines the whole modeling process that includes extracting data from various resources, deriving training datasets to train a set of predictors and applying Bayesian techniques to assemble the predictors, inferring on the ensemble of predictors to insert missing data, and eventually improving draft metabolic networks automatically. By the enhancement of metabolic model construction, SWARM enables scientists to generate many genome-scale metabolic models within a short period of time and with less effort.

[1]  Edward A. Lee,et al.  CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Taverna: Lessons in creating , 2022 .

[2]  Monica L. Mo,et al.  Global reconstruction of the human metabolic network based on genomic and bibliomic data , 2007, Proceedings of the National Academy of Sciences.

[3]  Markus J. Herrgård,et al.  Reconstruction and validation of Saccharomyces cerevisiae iND750, a fully compartmentalized genome-scale metabolic model. , 2004, Genome research.

[4]  B. Palsson,et al.  Thirteen Years of Building Constraint-Based In Silico Models of Escherichia coli , 2003, Journal of bacteriology.

[5]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[6]  Naryttza N. Diaz,et al.  The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes , 2005, Nucleic acids research.

[7]  Steffen Klamt,et al.  Structural and functional analysis of cellular networks with CellNetAnalyzer , 2007, BMC Systems Biology.

[8]  Jano I. van Hemert,et al.  Scientific Workflow: A Survey and Research Directions , 2007, PPAM.

[9]  Adam M. Feist,et al.  Modeling methanogenesis with a genome‐scale metabolic reconstruction of Methanosarcina barkeri , 2006 .

[10]  B. Palsson,et al.  Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network. , 2003, Genome research.

[11]  Adam M. Feist,et al.  A genome-scale metabolic reconstruction for Escherichia coli K-12 MG1655 that accounts for 1260 ORFs and thermodynamic information , 2007, Molecular systems biology.

[12]  Peter D. Karp,et al.  The Pathway Tools software , 2002, ISMB.

[13]  Jochen Förster,et al.  Modeling Lactococcus lactis using a genome-scale flux model , 2005, BMC Microbiology.

[14]  Steffen Klamt,et al.  FluxAnalyzer: exploring structure, pathways, and flux distributions in metabolic networks on interactive flux maps , 2003, Bioinform..

[15]  G. Church,et al.  Genome-Scale Metabolic Model of Helicobacter pylori 26695 , 2002, Journal of bacteriology.

[16]  Rick L. Stevens,et al.  The SEED: a peer-to-peer environment for genome annotation , 2004, CACM.

[17]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[18]  John Gould,et al.  Toward the automated generation of genome-scale metabolic networks in the SEED , 2007, BMC Bioinformatics.

[19]  B. Palsson,et al.  Expanded Metabolic Reconstruction of Helicobacter pylori (iIT341 GSM/GPR): an In Silico Genome-Scale Characterization of Single- and Double-Deletion Mutants , 2005, Journal of bacteriology.

[20]  George M. Church,et al.  Filling gaps in a metabolic network using expression information , 2004, ISMB/ECCB.

[21]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[22]  Bertram Ludäscher,et al.  Actor-Oriented Design of Scientific Workflows , 2005, ER.

[23]  J. Edwards,et al.  Systems Properties of the Haemophilus influenzaeRd Metabolic Genotype* , 1999, The Journal of Biological Chemistry.

[24]  B. Palsson,et al.  Metabolic modelling of microbes: the flux-balance approach. , 2002, Environmental microbiology.

[25]  B. Palsson,et al.  Genome-scale Reconstruction of Metabolic Network in Bacillus subtilis Based on High-throughput Phenotyping and Gene Essentiality Data* , 2007, Journal of Biological Chemistry.

[26]  Yoav Freund,et al.  Identifying metabolic enzymes with multiple types of association evidence , 2006, BMC Bioinformatics.

[27]  B. Palsson,et al.  An expanded genome-scale model of Escherichia coli K-12 (iJR904 GSM/GPR) , 2003, Genome Biology.

[28]  B. Palsson,et al.  Genome-scale reconstruction of the metabolic network in Staphylococcus aureus N315: an initial draft to the two-dimensional annotation , 2005, BMC Microbiology.

[29]  Ronan M. T. Fleming,et al.  Quantitative prediction of cellular metabolism with constraint-based models: the COBRA Toolbox v2.0 , 2007, Nature Protocols.

[30]  B. Palsson,et al.  Metabolic Flux Balancing: Basic Concepts, Scientific and Practical Use , 1994, Bio/Technology.

[31]  Carole A. Goble,et al.  myGrid: personalised bioinformatics on the information grid , 2003, ISMB.

[32]  L. Milanesi,et al.  Bioinformatics Workflow using ASSIST on GRID , 2005 .

[33]  B. Palsson,et al.  The Escherichia coli MG1655 in silico metabolic genotype: its definition, characteristics, and capabilities. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Peter D. Karp,et al.  A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases , 2004, BMC Bioinformatics.

[35]  R. Overbeek,et al.  Missing genes in metabolic pathways: a comparative genomics approach. , 2003, Current opinion in chemical biology.

[36]  Yoshihiro Yamanishi,et al.  KEGG for linking genomes to life and the environment , 2007, Nucleic Acids Res..

[37]  Rajkumar Buyya,et al.  A taxonomy of scientific workflow systems for grid computing , 2005, SGMD.

[38]  Yukiko Matsuoka,et al.  Celldesigner: A Modeling Tool for Biochemical Networks , 2006, Proceedings of the 2006 Winter Simulation Conference.