ComPath: comparative enzyme analysis and annotation in pathway/subsystem contexts

BackgroundOnce a new genome is sequenced, one of the important questions is to determine the presence and absence of biological pathways. Analysis of biological pathways in a genome is a complicated task since a number of biological entities are involved in pathways and biological pathways in different organisms are not identical. Computational pathway identification and analysis thus involves a number of computational tools and databases and typically done in comparison with pathways in other organisms. This computational requirement is much beyond the capability of biologists, so information systems for reconstructing, annotating, and analyzing biological pathways are much needed. We introduce a new comparative pathway analysis workbench, ComPath, which integrates various resources and computational tools using an interactive spreadsheet-style web interface for reliable pathway analyses.ResultsComPath allows users to compare biological pathways in multiple genomes using a spreadsheet style web interface where various sequence-based analysis can be performed either to compare enzymes (e.g. sequence clustering) and pathways (e.g. pathway hole identification), to search a genome for de novo prediction of enzymes, or to annotate a genome in comparison with reference genomes of choice. To fill in pathway holes or make de novo enzyme predictions, multiple computational methods such as FASTA, Whole-HMM, CSR-HMM (a method of our own introduced in this paper), and PDB-domain search are integrated in ComPath. Our experiments show that FASTA and CSR-HMM search methods generally outperform Whole-HMM and PDB-domain search methods in terms of sensitivity, but FASTA search performs poorly in terms of specificity, detecting more false positive as E-value cutoff increases. Overall, CSR-HMM search method performs best in terms of both sensitivity and specificity. Gene neighborhood and pathway neighborhood (global network) visualization tools can be used to get context information that is complementary to conventional KEGG map representation.ConclusionComPath is an interactive workbench for pathway reconstruction, annotation, and analysis where experts can perform various sequence, domain, context analysis, using an intuitive and interactive spreadsheet-style interface.

[1]  J. Szustakowski,et al.  Computational identification of operons in microbial genomes. , 2002, Genome research.

[2]  Inna Dubchak,et al.  The integrated microbial genomes (IMG) system , 2005, Nucleic Acids Res..

[3]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[4]  An-Ping Zeng,et al.  Decomposition of metabolic network into functional modules based on the global connectivity structure of reaction graph , 2004, Bioinform..

[5]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[6]  Peter D. Karp,et al.  MetaCyc: a multiorganism database of metabolic pathways and enzymes , 2005, Nucleic Acids Res..

[7]  Susumu Goto,et al.  The KEGG databases at GenomeNet , 2002, Nucleic Acids Res..

[8]  Duane Szafron,et al.  The Path-A metabolic pathway prediction web server , 2006, Nucleic Acids Res..

[9]  Peter D. Karp,et al.  The Pathway Tools software , 2002, ISMB.

[10]  Eric C. Rouchka,et al.  Gibbs Recursive Sampler: finding transcription factor binding sites , 2003, Nucleic Acids Res..

[11]  Christian J. A. Sigrist,et al.  ProRule: a new database containing functional and structural information on PROSITE profiles , 2005, Bioinform..

[12]  David S. Wishart,et al.  Circular genome visualization and exploration using CGView , 2005, Bioinform..

[13]  Owen White,et al.  Genome Properties: a system for the investigation of prokaryotic genetic content for microbiology, genome annotation and comparative genomics , 2005, Bioinform..

[14]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[15]  Janet M. Thornton,et al.  SCOPEC: a database of protein catalytic domains , 2004, ISMB/ECCB.

[16]  Conrad C. Huang,et al.  Leveraging enzyme structure-function relationships for functional inference and experimental design: the structure-function linkage database. , 2006, Biochemistry.

[17]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[18]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[19]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Sun Kim,et al.  iGibbs: Improving Gibbs motif sampler for proteins by sequence clustering and iterative pattern sampling , 2006, Proteins.

[21]  E. Koonin,et al.  Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. , 2001, Genome research.

[22]  Sean R. Eddy,et al.  ATV: display and manipulation of annotated phylogenetic , 2001, Bioinform..

[23]  A. Hughes,et al.  Pattern and timing of gene duplication in animal genomes. , 2001, Genome research.

[24]  P. Babbitt Definitions of enzyme function for the structural genomics era. , 2003, Current opinion in chemical biology.

[25]  Jianmin Wu,et al.  KOBAS server: a web-based platform for automated annotation and pathway identification , 2006, Nucleic Acids Res..

[26]  Vladimir Batagelj,et al.  Pajek - Analysis and Visualization of Large Networks , 2004, Graph Drawing Software.

[27]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[28]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[29]  Cyrus Chothia,et al.  The SUPERFAMILY database in 2004: additions and improvements , 2004, Nucleic Acids Res..

[30]  Jiong Yang,et al.  Gene teams with relaxed proximity constraint , 2005, 2005 IEEE Computational Systems Bioinformatics Conference (CSB'05).

[31]  R. Overbeek,et al.  Missing genes in metabolic pathways: a comparative genomics approach. , 2003, Current opinion in chemical biology.

[32]  Yu Ma,et al.  PLATCOM: a Platform for Computational Comparative Genomics , 2005, Bioinform..

[33]  Amit Saple,et al.  A hybrid gene team model and its application to genome analysis. , 2006, Journal of bioinformatics and computational biology.

[34]  Narmada Thanki,et al.  CDD: a conserved domain database for interactive domain family analysis , 2006, Nucleic Acids Res..

[35]  Naryttza N. Diaz,et al.  The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes , 2005, Nucleic acids research.

[36]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[37]  Jason Lee,et al.  BAG: a graph theoretic sequence clustering algorithm , 2006, Int. J. Data Min. Bioinform..

[38]  Ross A. Overbeek,et al.  Automatic detection of subsystem/pathway variants in genome analysis , 2005, ISMB.