Megafiller: A Retrofitted Protein Function Predictor for Filling Gaps inMetabolic Networks

Background: A bottleneck in investigating the cellular metabolism and physiology of organisms is the presence of metabolic gaps in the genome-scale metabolic networks. Metabolic gaps are reactions in the network that the corresponding genes have not yet been identified. Previous gap filling methods are generally based on identifying protein family in related organisms and then use this family to help for finding the target gene in a given genome. However, these methods fail when the protein family is not well-defined. There are therefore still many gaps in current metabolic networks. Here, we attempt to fill these gaps via an indirect approach by retrofitting protein function predictors and post-processing their results to identify the candidate genes. Results: We developed a novel method for metabolic gap filling, called MeGaFiller that uses an ensemble of multiple retrofitted state-of-the-art protein function predictors. The ensemble scheme was adopted to boost the prediction performance. MeGaFiller can propose the candidate genes for 35% of the metabolic gaps in different metabolic networks (i.e. yeast, three filamentous fungi and bacterium). MeGaFiller can predict novel candidate up to hundreds genes for earlier annotated functions in the metabolic networks. MeGaFiller can also provide novel candidate genes for novel putative reactions throughout the metabolic networks. Conclusions: MeGaFiller method demonstrates our first effort for filling metabolic gaps in the metabolic networks by retrofitted protein function predictors. It serves as a bioinformatics tool assisting for improved annotation through metabolic network reconstruction at a genome-scale.

[1]  Francisco J. Planes,et al.  Bioinformatic progress and applications in metaproteogenomics for bridging the gap between genomic sequences and metabolic functions in microbial communities , 2013, Proteomics.

[2]  D. Vitkup,et al.  Predicting genes for orphan metabolic activities using phylogenetic profiles , 2006, Genome Biology.

[3]  Jeffrey D. Orth,et al.  Systematizing the generation of missing metabolic knowledge , 2010, Biotechnology and bioengineering.

[4]  B. Palsson,et al.  An expanded genome-scale model of Escherichia coli K-12 (iJR904 GSM/GPR) , 2003, Genome Biology.

[5]  Alain Viari,et al.  The CanOE Strategy: Integrating Genomic and Metabolic Contexts across Multiple Prokaryote Genomes to Find Candidate Genes for Orphan Enzymes , 2012, PLoS Comput. Biol..

[6]  J. Skolnick,et al.  EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference. , 2004, Nucleic acids research.

[7]  Limsoon Wong,et al.  An efficient strategy for extensive integration of diverse biological data for protein function prediction , 2007, Bioinform..

[8]  John Gould,et al.  Toward the automated generation of genome-scale metabolic networks in the SEED , 2007, BMC Bioinformatics.

[9]  Wanwipa Vongsangnak,et al.  Improved annotation through genome-scale metabolic modeling of Aspergillus oryzae , 2008, BMC Genomics.

[10]  Juan Miguel García-Gómez,et al.  BIOINFORMATICS APPLICATIONS NOTE Sequence analysis Manipulation of FASTQ data with Galaxy , 2005 .

[11]  Rachael P. Huntley,et al.  The UniProt-GO Annotation database in 2011 , 2011, Nucleic Acids Res..

[12]  J. Nielsen,et al.  Genome-scale analysis of Streptomyces coelicolor A3(2) metabolism. , 2005, Genome research.

[13]  Vinay Satish Kumar,et al.  Optimization based automated curation of metabolic reconstructions , 2007, BMC Bioinformatics.

[14]  George M. Church,et al.  Filling gaps in a metabolic network using expression information , 2004, ISMB/ECCB.

[15]  J. Nielsen,et al.  Analysis of Aspergillus nidulans metabolism at the genome-scale , 2008, BMC Genomics.

[16]  Marek S. Skrzypek,et al.  The Aspergillus Genome Database: multispecies curation and incorporation of RNA-Seq data to improve structural gene annotations , 2013, Nucleic Acids Res..

[17]  K. Isono,et al.  Genome sequencing and analysis of Aspergillus oryzae , 2005, Nature.

[18]  Juho Rousu,et al.  Computational methods for metabolic reconstruction. , 2010, Current opinion in biotechnology.

[19]  Ying Huang,et al.  EFICAz2: enzyme function inference by a combined approach enhanced by machine learning , 2009, BMC Bioinformatics.

[20]  Juho Rousu,et al.  A Computational Method for Reconstructing Gapless Metabolic Networks , 2008, BIRD.

[21]  Alison S. Waller,et al.  Prediction and identification of sequences coding for orphan enzymes using genomic and metagenomic neighbours , 2012, Molecular systems biology.

[22]  Peter D. Karp,et al.  Pathway Tools version 13.0: integrated software for pathway/genome informatics and systems biology , 2015, Briefings Bioinform..

[23]  B. Barrell,et al.  Complete genome sequence of the model actinomycete Streptomyces coelicolor A3(2) , 2002, Nature.

[24]  R. Overbeek,et al.  Missing genes in metabolic pathways: a comparative genomics approach. , 2003, Current opinion in chemical biology.

[25]  Andrew M. Lynn,et al.  ModEnzA: Accurate Identification of Metabolic Enzymes Using Function Specific Profile HMMs with Optimised Discrimination Threshold and Modified Emission Probabilities , 2011, Adv. Bioinformatics.

[26]  B. Palsson,et al.  A protocol for generating a high-quality genome-scale metabolic reconstruction , 2010 .

[27]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[28]  J. J. Díaz-Mejía,et al.  Network-based function prediction and interactomics: the case for metabolic enzymes. , 2011, Metabolic engineering.

[29]  Geoffrey J. Barton,et al.  GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes , 2004, BMC Bioinformatics.

[30]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[31]  Narmada Thanki,et al.  CDD: conserved domains and protein three-dimensional structure , 2012, Nucleic Acids Res..

[32]  Gary D. Bader,et al.  The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function , 2010, Nucleic Acids Res..

[33]  Bairong Shen,et al.  Genome-scale analysis of the metabolic networks of oleaginous Zygomycete fungi. , 2013, Gene.

[34]  The UniProt Consortium,et al.  Reorganizing the protein space at the Universal Protein Resource (UniProt) , 2011, Nucleic Acids Res..

[35]  M. Kanehisa,et al.  Prediction of missing enzyme genes in a bacterial metabolic network , 2007, The FEBS journal.

[36]  V. de Crécy-Lagard,et al.  'Unknown' proteins and 'orphan' enzymes: the missing half of the engineering parts list--and how to find it. , 2009, The Biochemical journal.

[37]  Stefan Götz,et al.  Blast2GO: A Comprehensive Suite for Functional Analysis in Plant Genomics , 2007, International journal of plant genomics.

[38]  Bas Teusink,et al.  Accelerating the reconstruction of genome-scale metabolic networks , 2006, BMC Bioinformatics.

[39]  Intawat Nookaew,et al.  The RAVEN Toolbox and Its Use for Generating a Genome-scale Metabolic Model for Penicillium chrysogenum , 2013, PLoS Comput. Biol..

[40]  J. Silberg,et al.  A transposase strategy for creating libraries of circularly permuted proteins , 2012, Nucleic acids research.

[41]  Edith D. Wong,et al.  Saccharomyces Genome Database: the genomics resource of budding yeast , 2011, Nucleic Acids Res..

[42]  Daisuke Kihara,et al.  Enhanced automated function prediction using distantly related sequences and contextual association by PFP , 2006, Protein science : a publication of the Protein Society.

[43]  J. Nielsen,et al.  Metabolic model integration of the bibliome, genome, metabolome and reactome of Aspergillus niger , 2008, Molecular systems biology.

[44]  Yoav Freund,et al.  Identifying metabolic enzymes with multiple types of association evidence , 2006, BMC Bioinformatics.

[45]  Intawat Nookaew,et al.  BioMet Toolbox: genome-wide analysis of metabolism , 2010, Nucleic Acids Res..

[46]  Intawat Nookaew,et al.  The genome-scale metabolic model iIN800 of Saccharomyces cerevisiae and its validation: a scaffold to query lipid metabolism , 2008, BMC Syst. Biol..

[47]  William P. Burns,et al.  Gap Detection for Genome-Scale Constraint-Based Models , 2012, Adv. Bioinformatics.