Gene function prediction based on genomic context clustering and discriminative learning: an application to bacteriophages

BackgroundExisting methods for whole-genome comparisons require prior knowledge of related species and provide little automation in the function prediction process. Bacteriophage genomes are an example that cannot be easily analyzed by these methods. This work addresses these shortcomings and aims to provide an automated prediction system of gene function.ResultsWe have developed a novel system called SynFPS to perform gene function prediction over completed genomes. The prediction system is initialized by clustering a large collection of weakly related genomes into groups based on their resemblance in gene distribution. From each individual group, data are then extracted and used to train a Support Vector Machine that makes gene function predictions. Experiments were conducted with 9 different gene functions over 296 bacteriophage genomes. Cross validation results gave an average prediction accuracy of ~80%, which is comparable to other genomic-context based prediction methods. Functional predictions are also made on 3 uncharacterized genes and 12 genes that cannot be identified by sequence alignment. The software is publicly available at http://www.synteny.net/.ConclusionThe proposed system employs genomic context to predict gene function and detect gene correspondence in whole-genome comparisons. Although our experimental focus is on bacteriophages, the method may be extended to other microbial genomes as they share a number of similar characteristics with phage genomes such as gene order conservation.

[1]  Nicholas L. Bray,et al.  AVID: A global alignment program. , 2003, Genome research.

[2]  B. Snel,et al.  Function prediction and protein networks. , 2003, Current opinion in cell biology.

[3]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[4]  Y.Z. Chen,et al.  Enzyme family classification by support vector machines , 2004, Proteins.

[5]  M. Kanehisa,et al.  Automatic detection of conserved gene clusters in multiple genomes by graph comparison and P-quasi grouping. , 2000, Nucleic acids research.

[6]  Saman K. Halgamuge,et al.  An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data , 2003, Bioinform..

[7]  Saman K. Halgamuge,et al.  Splice site identification using probabilistic parameters and SVM classification , 2006 .

[8]  J. Tamames,et al.  Bringing gene order into bacterial shape. , 2001, Trends in genetics : TIG.

[9]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[10]  R. Gibbs,et al.  PipMaker--a web server for aligning two genomic DNA sequences. , 2000, Genome research.

[11]  Saman K. Halgamuge,et al.  Enhancement of topology preservation and hierarchical dynamic self-organising maps for data visualisation , 2003, Int. J. Approx. Reason..

[12]  Susumu Goto,et al.  The KEGG databases at GenomeNet , 2002, Nucleic Acids Res..

[13]  Alistair G. Rust,et al.  Ensembl 2002: accommodating comparative genomics , 2003, Nucleic Acids Res..

[14]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[15]  Inna Dubchak,et al.  Glocal alignment: finding rearrangements during alignment , 2003, ISMB.

[16]  Lior Pachter,et al.  VISTA: computational tools for comparative genomics , 2004, Nucleic Acids Res..

[17]  Javier Tamames,et al.  Evolution of gene order conservation in prokaryotes , 2001, Genome Biology.

[18]  B. Snel,et al.  SHOT: a web server for the construction of genome phylogenies. , 2002, Trends in genetics : TIG.

[19]  Thomas L. Madden,et al.  BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. , 1999, FEMS microbiology letters.

[20]  M. Baker,et al.  Coat protein fold and maturation transition of bacteriophage P22 seen at subnanometer resolutions , 2003, Nature Structural Biology.

[21]  I. Wang,et al.  Holins: the protein clocks of bacteriophage infections. , 2000, Annual review of microbiology.

[22]  Lincoln Stein,et al.  Synbrowse: a Synteny Browser for Comparative Sequence Analysis , 2022 .

[23]  Charles DeLisi,et al.  Identifying functional links between genes using conserved chromosomal proximity. , 2002, Trends in genetics : TIG.

[24]  Roger W. Hendrix,et al.  Phage Genomics Small Is Beautiful , 2002, Cell.

[25]  E. Koonin,et al.  Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. , 2001, Genome research.

[26]  Christian von Mering,et al.  STRING: known and predicted protein–protein associations, integrated and transferred across organisms , 2004, Nucleic Acids Res..

[27]  Deborah Jacobs-Sera,et al.  Exploring the Mycobacteriophage Metaproteome: Phage Genomics as an Educational Platform , 2006, PLoS genetics.

[28]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[29]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[30]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[31]  Graham F Hatfull,et al.  Bacteriophage genomics. , 2008, Current opinion in microbiology.