Prediction of complete gene structures in human genomic DNA.

We introduce a general probabilistic model of the gene structure of human genomic sequences which incorporates descriptions of the basic transcriptional, translational and splicing signals, as well as length distributions and compositional features of exons, introns and intergenic regions. Distinct sets of model parameters are derived to account for the many substantial differences in gene density and structure observed in distinct C + G compositional regions of the human genome. In addition, new models of the donor and acceptor splice signals are described which capture potentially important dependencies between signal positions. The model is applied to the problem of gene identification in a computer program, GENSCAN, which identifies complete exon/intron structures of genes in genomic DNA. Novel features of the program include the capacity to predict multiple genes in a sequence, to deal with partial as well as complete genes, and to predict consistent sets of genes occurring on either or both DNA strands. GENSCAN is shown to have substantially higher accuracy than existing methods when tested on standardized sets of human and vertebrate genes, with 75 to 80% of exons identified exactly. The program is also capable of indicating fairly accurately the reliability of each predicted exon. Consistently high levels of accuracy are observed for sequences of differing C + G content and for distinct groups of vertebrates.

[1]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[2]  B. Wieringa,et al.  A minimal intron length but no specific internal sequence is required for splicing the large rabbit β-globin intron , 1984, Cell.

[3]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[4]  J. Hawkins,et al.  A survey on intron and exon lengths. , 1988, Nucleic acids research.

[5]  G. Bernardi,et al.  The isochore organization of the human genome. , 1989, Annual review of genetics.

[6]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[7]  D. Lockshon,et al.  MyoD is a sequence-specific DNA binding protein requiring a region of myc homology to bind to the muscle creatine kinase enhancer , 1989, Cell.

[8]  S. Berget,et al.  Exon definition may facilitate splice site selection in RNAs with multiple exons. , 1990, Molecular and cellular biology.

[9]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[10]  N. Harris,et al.  Distribution and consensus of branch point signals in eukaryotic genes: a computerized statistical analysis. , 1990, Nucleic acids research.

[11]  P. Bucher Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. , 1990, Journal of molecular biology.

[12]  D L Black,et al.  Does steric interference between splice sites block the splicing of a short c-src neuron-specific exon in non-neuronal cells? , 1991, Genes & development.

[13]  R Kole,et al.  Selection of splice sites in pre-mRNAs with short internal exons , 1991, Molecular and cellular biology.

[14]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[15]  Volker Brendel,et al.  PROSET-a fast procedure to create non-redundant sets of protein sequences , 1992 .

[16]  D. Sankoff Efficient optimal decomposition of a sequence into disjoint regions, each matched to some template in an inventory. , 1992, Mathematical biosciences.

[17]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[18]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[19]  G. Bernardi,et al.  The vertebrate genome: isochores and evolution. , 1993, Molecular biology and evolution.

[20]  M. Gelfand,et al.  Prediction of the exon-intron structure by a dynamic programming approach. , 1993, Bio Systems.

[21]  Michael Q. Zhang,et al.  A weight array method for splicing signal analysis , 1993, Comput. Appl. Biosci..

[22]  M. Mckeown,et al.  The role of small nuclear RNAs in RNA splicing. , 1993, Current opinion in cell biology.

[23]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.

[24]  H. Prydz,et al.  Evaluation of the exon predictions of the GRAIL software. , 1994, Genomics.

[25]  L. Chasin,et al.  Large exon size does not limit splicing in vivo , 1994, Molecular and cellular biology.

[26]  Yin Xu,et al.  An Improved System for Exon Recognition and Gene Modeling in Human DNA Sequence , 1994, ISMB.

[27]  V. Solovyev,et al.  Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. , 1994, Nucleic acids research.

[28]  David Haussler,et al.  Optimally Parsing a Sequence into Different Classes Based on Multiple Types of Evidence , 1994, ISMB.

[29]  D. S. Prestridge Predicting Pol II promoter sequences using transcription factor binding sites. , 1995, Journal of molecular biology.

[30]  E. Snyder,et al.  Identification of protein coding regions in genomic DNA. , 1995, Journal of molecular biology.

[31]  M S Gelfand,et al.  Prediction of function in DNA sequence analysis. , 1995, Journal of computational biology : a journal of computational molecular cell biology.

[32]  M. Boguski The turning point in genome research. , 1995, Trends in biochemical sciences.

[33]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[34]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Jerzy Jurka,et al.  Censor - a Program for Identification and Elimination of Repetitive Elements From DNA Sequences , 1996, Comput. Chem..

[36]  C. Lilley,et al.  A gene-rich cluster between the CD4 and triosephosphate isomerase genes at human chromosome 12p13. , 1996, Genome research.

[37]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[38]  S. Berget,et al.  Architectural limits on split genes. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Thomas D. Wu A Segment-Based Dynamic Programming Algorithm for Predicting Gene Structure , 1996, J. Comput. Biol..

[40]  J W Fickett,et al.  Finding genes by computer: the state of the art. , 1996, Trends in genetics : TIG.