GeneBuilder: interactive in silico prediction of gene structure

MOTIVATION Prediction of gene structure in newly sequenced DNA becomes very important in large genome sequencing projects. This problem is complicated due to the exon-intron structure of eukaryotic genes and because gene expression is regulated by many different short nucleotide domains. In order to be able to analyse the full gene structure in different organisms, it is necessary to combine information about potential functional signals (promoter region, splice sites, start and stop codons, 3' untranslated region) together with the statistical properties of coding sequences (coding potential), information about homologous proteins, ESTs and repeated elements. RESULTS We have developed the GeneBuilder system which is based on prediction of functional signals and coding regions by different approaches in combination with similarity searches in proteins and EST databases. The potential gene structure models are obtained by using a dynamic programming method. The program permits the use of several parameters for gene structure prediction and refinement. During gene model construction, selecting different exon homology levels with a protein sequence selected from a list of homologous proteins can improve the accuracy of the gene structure prediction. In the case of low homology, GeneBuilder is still able to predict the gene structure. The GeneBuilder system has been tested by using the standard set (Burset and Guigo, Genomics, 34, 353-367, 1996) and the performances are: 0.89 sensitivity and 0.91 specificity at the nucleotide level. The total correlation coefficient is 0.88. AVAILABILITY The GeneBuilder system is implemented as a part of the WebGene a the URL: http://www.itba.mi. cnr.it/webgene and TRADAT (TRAncription Database and Analysis Tools) launcher URL: http://www.itba.mi.cnr.it/tradat.

[1]  Chris A. Fields,et al.  gm: a practical tool for automating DNA sequence analysis , 1990, Comput. Appl. Biosci..

[2]  Luciano Milanesi,et al.  Gene structure prediction using information on homologous protein sequence , 1996, Comput. Appl. Biosci..

[3]  R. Guigó,et al.  Computational gene identification , 1997, Journal of Molecular Medicine.

[4]  D. Searls,et al.  Gene structure prediction by linguistic methods. , 1994, Genomics.

[5]  E. Snyder,et al.  Identification of protein coding regions in genomic DNA. , 1995, Journal of molecular biology.

[6]  C. Rawlings,et al.  Identification and analysis of multigene families by comparison of exon fingerprints. , 1995, Journal of molecular biology.

[7]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[8]  Richard Mott,et al.  EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA , 1997, Comput. Appl. Biosci..

[9]  R J Roberts,et al.  Finding errors in DNA sequences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[10]  M H Skolnick,et al.  A probabilistic model for detecting coding regions in DNA sequences. , 1994, IMA journal of mathematics applied in medicine and biology.

[11]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.

[12]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[13]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[14]  Michael R. Hayden,et al.  The prediction of exons through an analysis of spliceable open reading frames , 1992, Nucleic Acids Res..

[15]  M. Gelfand,et al.  Prediction of the exon-intron structure by a dynamic programming approach. , 1993, Bio Systems.

[16]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[17]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[18]  E V Koonin,et al.  New genes in old sequence: a strategy for finding genes in the bacterial genome. , 1994, Trends in biochemical sciences.

[19]  J M Claverie,et al.  Effective large-scale sequence similarity searches. , 1996, Methods in enzymology.

[20]  T. Werner,et al.  MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. , 1995, Nucleic acids research.

[21]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[22]  B. Chabot Directing alternative splicing: cast and scenarios. , 1996, Trends in genetics : TIG.

[23]  David J. States,et al.  QGB: Combined Use of Sequence Similarity and Codon Bias for Coding Region Identification , 1994, J. Comput. Biol..

[24]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[25]  M S Gelfand,et al.  Prediction of function in DNA sequence analysis. , 1995, Journal of computational biology : a journal of computational molecular cell biology.

[26]  Luciano Milanesi,et al.  Hamming-Clustering method for signals prediction in 5' and 3' regions of eukaryotic genes , 1996, Comput. Appl. Biosci..

[27]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank. , 1991, Nucleic acids research.

[28]  J W Fickett,et al.  Finding genes by computer: the state of the art. , 1996, Trends in genetics : TIG.

[29]  Michael Ruogu Zhang,et al.  Identification of protein coding regions in the human genome by quadratic discriminant analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[30]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[31]  Alexander E. Kel,et al.  GenViewer: A computing tool for protein-coding regions prediction in nucleotide sequences , 1993 .

[32]  Luciano Milanesi,et al.  Analysis of donor splice sites in different eukaryotic organisms , 1997, Journal of Molecular Evolution.

[33]  M S Gelfand,et al.  Computer prediction of the exon-intron structure of mammalian pre-mRNAs. , 1990, Nucleic acids research.

[34]  D J States,et al.  Molecular sequence accuracy and the analysis of protein coding regions. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Ying Xu,et al.  Correcting sequencing errors in DNA coding regions using a dynamic programming approach , 1995, Comput. Appl. Biosci..

[36]  M S Boguski,et al.  Gene discovery in dbEST. , 1994, Science.

[37]  G. Fichant,et al.  A frameshift error detection algorithm for DNA sequencing projects. , 1995, Nucleic acids research.

[38]  E. Snyder,et al.  Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. , 1993, Nucleic acids research.

[39]  Anders Krogh,et al.  Two Methods for Improving Performance of a HMM and their Application for Gene Finding , 1997, ISMB.

[40]  D. S. Prestridge Predicting Pol II promoter sequences using transcription factor binding sites. , 1995, Journal of molecular biology.

[41]  V. Solovyev,et al.  Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. , 1994, Nucleic acids research.

[42]  James W. Fickett,et al.  The Gene Identification Problem: An Overview for Developers , 1995, Comput. Chem..

[43]  Luciano Milanesi,et al.  10 – Prediction of Human Gene Structure , 1998 .

[44]  E. Uberbacher,et al.  Discovering and understanding genes in human DNA sequence using GRAIL. , 1996, Methods in enzymology.

[45]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[46]  Kenneth H. Fasman,et al.  Finding Genes in Human DNA with a Hidden Markov Model , 1996, ISMB 1996.

[47]  M. Frommer,et al.  CpG islands in vertebrate genomes. , 1987, Journal of molecular biology.