Recognizing shorter coding regions of human genes based on the statistics of stop codons.

With the quick progress of the Human Genome Project, a great amount of uncharacterized DNA sequences needs to be annotated copiously by better algorithms. Recognizing shorter coding sequences of human genes is one of the most important problems in gene recognition, which is not yet completely solved. This paper is devoted to solving the issue using a new method. The distributions of the three stop codons, i.e., TAA, TAG and TGA, in three phases along coding, noncoding, and intergenic sequences are studied in detail. Using the obtained distributions and other coding measures, a new algorithm for the recognition of shorter coding sequences of human genes is developed. The accuracy of the algorithm is tested based on a larger database of human genes. It is found that the average accuracy achieved is as high as 92.1% for the sequences with length of 192 base pairs, which is confirmed by sixfold cross-validation tests. It is hoped that by incorporating the present method with some existing algorithms, the accuracy for identifying human genes from unannotated sequences would be increased.

[1]  K C Chou,et al.  Graphic analysis of codon usage strategy in 1490 human proteins , 1993, Journal of protein chemistry.

[2]  R. Linsker,et al.  A measure of DNA periodicity. , 1986, Journal of theoretical biology.

[3]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[4]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[5]  Wentian Li,et al.  Statistical Properties of Open Reading Frames in Complete Genome Sequences , 1999, Comput. Chem..

[6]  J. Hawkins,et al.  A survey on intron and exon lengths. , 1988, Nucleic acids research.

[7]  Michael Ruogu Zhang,et al.  Identification of protein coding regions in the human genome by quadratic discriminant analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[8]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[9]  E. Triphosphat,et al.  FEBS Letters , 1987, FEBS Letters.

[10]  J W Fickett,et al.  Distinctive sequence features in protein coding genic non-coding, and intergenic human DNA. , 1995, Journal of molecular biology.

[11]  E. Trifonov Translation framing code and frame-monitoring mechanism as suggested by the analysis of mRNA and 16 S rRNA nucleotide sequences. , 1987, Journal of molecular biology.

[12]  A. Lapedes,et al.  Determination of eukaryotic protein coding regions using neural networks and information theory. , 1992, Journal of molecular biology.

[13]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[14]  C. Zhang,et al.  Diagrammatization of codon usage in 339 human immunodeficiency virus proteins and its biological implication. , 1992, AIDS research and human retroviruses.

[15]  T A Thanaraj,et al.  Positional characterisation of false positives from computational prediction of human splice sites. , 2000, Nucleic acids research.

[16]  James W. Fickett,et al.  The Gene Identification Problem: An Overview for Developers , 1995, Comput. Chem..

[17]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[18]  M. Yan,et al.  A new fourier transform approach for protein coding measure based on the format of the Z curve , 1998, Bioinform..

[19]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[20]  P Lió,et al.  Third codon G + C periodicity as a possible signal for an "internal" selective constraint. , 1994, Journal of theoretical biology.

[21]  Y. Ohfuku,et al.  A transcription frame‐based analysis of the genomic DNA sequence of a hyper‐thermophilic archaeon for the identification of genes, pseudo‐genes and operon structures , 1998, FEBS letters.

[22]  Steven Salzberg,et al.  A Decision Tree System for Finding Genes in DNA , 1998, J. Comput. Biol..

[23]  M. Gerstein,et al.  Digging for dead genes: an analysis of the characteristics of the pseudogene population in the Caenorhabditis elegans genome. , 2001, Nucleic acids research.

[24]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[25]  A. D. McLachlan,et al.  Codon preference and its use in identifying protein coding regions in long DNA sequences , 1982, Nucleic Acids Res..

[26]  R. Quatrano Genomics , 1998, Plant Cell.

[27]  L. Duret,et al.  Nature and structure of human genes that generate retropseudogenes. , 2000, Genome research.

[28]  Michael Ruogu Zhang,et al.  Statistical features of human exons and their flanking regions. , 1998, Human molecular genetics.

[29]  C. Zhang,et al.  Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve. , 2000, Nucleic acids research.

[30]  A A Tsonis,et al.  Periodicity in DNA coding sequences: implications in gene evolution. , 1991, Journal of theoretical biology.