Protein-coding regions prediction combining similarity searches and conservative evolutionary properties of protein-coding sequences.

The gene identification procedure in a completely new gene with no good homology with protein sequences can be a very complex task. In order to identify the protein-coding region, a new method, 'SYNCOD', based on the analysis of conservative evolutionary properties of coding regions, has been realized. This program is able to identify and use the coding region homologies of the non-annotated (unknown) protein-coding sequences already present in the nucleotide sequence databases by using the alignment produced by BLASTN. The ratio of number mismatches resulting in synonymous codons to the number of mismatches resulting in non-synonymous codons is estimated for each open reading frame. Monte Carlo simulations are then used to estimate the significance of the ratio deviation from random behavior. The SYNCOD program has been tested on generated random sequences and on different control sets. The high accuracy of predicting protein-coding regions (the correlation coefficient, CC, varies from 0.67 to 0.79) and the high specificity (the portion of wrong exons, WE, varies from 0.06 to 0.07) have proved to be important features of the suggested approach. The SYNCOD program is resident on the ITBA-CNR Web Server and can be used via the Internet (URL: www.itba.mi.cnr.it/webgene).

[1]  E. Snyder,et al.  Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. , 1993, Nucleic acids research.

[2]  David J. States,et al.  QGB: Combined Use of Sequence Similarity and Codon Bias for Coding Region Identification , 1994, J. Comput. Biol..

[3]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[4]  Chris A. Fields,et al.  gm: a practical tool for automating DNA sequence analysis , 1990, Comput. Appl. Biosci..

[5]  Martin J. Bishop,et al.  Guide to Human Genome Computing , 1994 .

[6]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.

[7]  M S Gelfand,et al.  Computer prediction of the exon-intron structure of mammalian pre-mRNAs. , 1990, Nucleic acids research.

[8]  Luciano Milanesi,et al.  10 – Prediction of Human Gene Structure , 1998 .

[9]  Jerzy Jurka,et al.  Censor - a Program for Identification and Elimination of Repetitive Elements From DNA Sequences , 1996, Computers and Chemistry.

[10]  Mikhail S. Gelfand,et al.  Prediction of Function in DNA Sequence , 1995, J. Comput. Biol..

[11]  V. Solovyev,et al.  Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. , 1994, Nucleic acids research.

[12]  Steven Salzberg,et al.  Finding Genes in DNA with a Hidden Markov Model , 1997, J. Comput. Biol..

[13]  Luciano Milanesi,et al.  Analysis of donor splice sites in different eukaryotic organisms , 1997, Journal of Molecular Evolution.

[14]  Michael R. Hayden,et al.  The prediction of exons through an analysis of spliceable open reading frames , 1992, Nucleic Acids Res..

[15]  M H Skolnick,et al.  A probabilistic model for detecting coding regions in DNA sequences. , 1994, IMA journal of mathematics applied in medicine and biology.

[16]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[17]  N L Harris,et al.  Splice junctions, branch point sites, and exons: sequence statistics, identification, and applications to genome project. , 1990, Methods in enzymology.

[18]  E V Koonin,et al.  New genes in old sequence: a strategy for finding genes in the bacterial genome. , 1994, Trends in biochemical sciences.

[19]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[20]  N. Kolchanov,et al.  Somatic hypermutagenesis in immunoglobulin genes. I. Correlation between somatic mutations and repeats. Somatic mutation properties and clonal selection. , 1991, Biochimica et biophysica acta.

[21]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[22]  Luciano Milanesi,et al.  Gene structure prediction using information on homologous protein sequence , 1996, Comput. Appl. Biosci..

[23]  D. Searls,et al.  Gene structure prediction by linguistic methods. , 1994, Genomics.

[24]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[25]  J W Fickett,et al.  Finding genes by computer: the state of the art. , 1996, Trends in genetics : TIG.

[26]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[27]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[28]  Michael Ruogu Zhang,et al.  Identification of protein coding regions in the human genome by quadratic discriminant analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[29]  G. Bernardi,et al.  The isochore organization of the human genome. , 1989, Annual review of genetics.

[30]  Wen-Hsiung Li,et al.  Fundamentals of molecular evolution , 1990 .

[31]  Alexander E. Kel,et al.  GenViewer: A computing tool for protein-coding regions prediction in nucleotide sequences , 1993 .

[32]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[33]  B. Dujon The yeast genome project: what did we learn? , 1996, Trends in genetics : TIG.

[34]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.