Improving gene annotation of complete viral genomes

Gene annotation in viruses often relies upon similarity search methods. These methods possess high specificity but some genes may be missed, either those unique to a particular genome or those highly divergent from known homologs. To identify potentially missing viral genes we have analyzed all complete viral genomes currently available in GenBank with a specialized and augmented version of the gene finding program GeneMarkS. In particular, by implementing genome-specific self-training protocols we have better adjusted the GeneMarkS statistical models to sequences of viral genomes. Hundreds of new genes were identified, some in well studied viral genomes. For example, a new gene predicted in the genome of the Epstein–Barr virus was shown to encode a protein similar to α-herpesvirus minor tegument protein UL14 with heat shock functions. Convincing evidence of this similarity was obtained after only 12 PSI-BLAST iterations. In another example, several iterations of PSI-BLAST were required to demonstrate that a gene predicted in the genome of Alcelaphine herpesvirus 1 encodes a BALF1-like protein which is thought to be involved in apoptosis regulation and, potentially, carcinogenesis. New predictions were used to refine annotations of viral genomes in the RefSeq collection curated by the National Center for Biotechnology Information. Importantly, even in those cases where no sequence similarities were detected, GeneMarkS significantly reduced the number of primary targets for experimental characterization by identifying the most probable candidate genes. The new genome annotations were stored in VIOLIN, an interactive database which provides access to similarity search tools for up-to-date analysis of predicted viral proteins.

[1]  Obi L. Griffith,et al.  The Genome Sequence of the SARS-Associated Coronavirus , 2003, Science.

[2]  Jean-Michel Claverie,et al.  A Giant Virus in Amoebae , 2003, Science.

[3]  Y. Yamauchi,et al.  Herpes simplex virus type 2 UL14 gene product has heat shock protein (HSP)-like functions. , 2002, Journal of cell science.

[4]  H. Schmidt,et al.  The Nucleotide Sequence of Shiga Toxin (Stx) 2e-Encoding Phage φP27 Is Not Related to Other Stx Phage Genomes, but the Modular Genetic Structure Is Conserved , 2002, Infection and Immunity.

[5]  G. Volckaert,et al.  The genome of bacteriophage φKZ of Pseudomonas aeruginosa , 2002 .

[6]  D. Sachs,et al.  Sequence analysis of the genome of porcine lymphotropic herpesvirus 1 and gene expression during posttransplant lymphoproliferative disease of pigs. , 2002, Virology.

[7]  D. Bellows,et al.  Epstein-Barr Virus BALF1 Is a BCL-2-Like Antagonist of the Herpesvirus Antiapoptotic BCL-2 Proteins , 2002, Journal of Virology.

[8]  Siu Man Chan,et al.  Sequence analysis of the complete genome of an iridovirus isolated from the tiger frog. , 2002, Virology.

[9]  Jun He,et al.  Complete Genome Sequence of the Shrimp White Spot Bacilliform Virus , 2001, Journal of Virology.

[10]  H. Krisch,et al.  A conserved genetic module that encodes the major virion components in both the coliphage T4 and the marine cyanophage S-PM2 , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[12]  M. Borodovsky,et al.  GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. , 2001, Nucleic acids research.

[13]  F. Blattner,et al.  Complete DNA Sequence and Analysis of the Large Virulence Plasmid of Shigella flexneri , 2001, Infection and Immunity.

[14]  K. Dybvig,et al.  Complete nucleotide sequence of the mycoplasma virus P1 genome. , 2001, Plasmid.

[15]  W. Zimmermann,et al.  Genome Sequence of Bovine Herpesvirus 4, a Bovine Rhadinovirus, and Identification of an Origin of DNA Replication , 2001, Journal of Virology.

[16]  S. Altschul,et al.  The estimation of statistical parameters for local alignment score distributions. , 2001, Nucleic acids research.

[17]  Frances M. G. Pearl,et al.  VIDA: a virus database system for the organization of animal virus genome open reading frames , 2001, Nucleic Acids Res..

[18]  T. Sekizaki,et al.  DNA Sequence and Comparison of Virulence Plasmids from Rhodococcus equi ATCC 33701 and 103 , 2000, Infection and Immunity.

[19]  C. Bruggeman,et al.  Complete DNA Sequence of the Rat Cytomegalovirus Genome , 2000, Journal of Virology.

[20]  Chris Upton,et al.  Viral Genome DataBase: storing and analyzing genes and proteins from complete viral genomes , 2000, Bioinform..

[21]  M. Borodovsky,et al.  Heuristic approach to deriving models for gene finding. , 1999, Nucleic acids research.

[22]  Tatiana A. Tatusova,et al.  Complete genomes in WWW Entrez: data representation and analysis , 1999, Bioinform..

[23]  Guy Plunkett,et al.  The complete DNA sequence and analysis of the large virulence plasmid of Escherichia coli O157:H7. , 1998, Nucleic acids research.

[24]  G. Sarkis,et al.  Genome structure of mycobacteriophage D29: implications for phage evolution. , 1998, Journal of molecular biology.

[25]  R. Durbin,et al.  Biological sequence analysis: Background on probability , 1998 .

[26]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[27]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[28]  George M. Church,et al.  Comparing the predicted and observed properties of proteins encoded in the genome of Escherichia coli K‐12 , 1997, Electrophoresis.

[29]  E. Kieff Epstein-Barr virus and its replication , 1996 .

[30]  A. Farmer,et al.  The Human Papillomavirus Database. , 1995, Journal of biomedical science.

[31]  M. Borodovsky,et al.  Intrinsic and extrinsic approaches for detecting genes in a bacterial genome. , 1994, Nucleic acids research.

[32]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[33]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[34]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[35]  F. Corpet Multiple sequence alignment with hierarchical clustering. , 1988, Nucleic acids research.

[36]  F. Sanger,et al.  Nucleotide sequence of bacteriophage lambda DNA. , 1982, Journal of molecular biology.

[37]  R. Staden,et al.  Nucleotide sequence of bacteriophage G4 DNA , 1978, Nature.