Computational Gene Identification: Under the Hood

Publisher Summary This chapter reviews the computational techniques for identifying genes in DNA sequences for the scientific layman and describes the working principles, the capabilities, and the limitations of gene identification software. Some attention is also given to likely future developments. The emphasis is on eukaryotes, as in this application domain the problem is of the most interest and difficulty. Two types of computational analysis are normally performed on essentially every newly determined DNA sequence. The first is a database search to compare the new sequence with existing collections (nucleotide sequence, amino acid sequence, or motif). The second, the topic of this study, is a search for protein-coding regions or genes. The chapter describes the three primary means of gathering clues about the existence, location, and function of genes, namely, database similarity search, statistical regularities of coding regions, and pattern recognition of functional sites. The purpose in this review is to provide an overview of these techniques for the person who would like to understand, at a high level, how computational gene identification is done.

[1]  J. Hawkins,et al.  A survey on intron and exon lengths. , 1988, Nucleic acids research.

[2]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[3]  Michael R. Hayden,et al.  The prediction of exons through an analysis of spliceable open reading frames , 1992, Nucleic Acids Res..

[4]  David B. Searls,et al.  The Linguistics of DNA , 1992 .

[5]  M H Skolnick,et al.  A probabilistic model for detecting coding regions in DNA sequences. , 1994, IMA journal of mathematics applied in medicine and biology.

[6]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.

[7]  M. Kanehisa,et al.  Prediction of splice junctions in mRNA sequences. , 1985, Nucleic acids research.

[8]  E. Uberbacher,et al.  Computer-based construction of gene models using the GRAIL Gene Assembly Program , 1992 .

[9]  Mikhail S. Gelfand,et al.  Prediction of Function in DNA Sequence , 1995, J. Comput. Biol..

[10]  V. Solovyev,et al.  Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. , 1994, Nucleic acids research.

[11]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[12]  J. Fickett Coordinate positioning of MEF2 and myogenin binding sites. , 1996, Gene.

[13]  A. Kerlavage,et al.  Complementary DNA sequencing: expressed sequence tags and human genome project , 1991, Science.

[14]  R F Doolittle,et al.  Construction of a facsimile data set for large genome sequence analysis. , 1990, Genomics.

[15]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[16]  M H Skolnick,et al.  Software trapping: a strategy for finding genes in large genomic regions. , 1995, Computers and biomedical research, an international journal.

[17]  J. Claverie,et al.  Identifying coding exons by similarity search: alu-derived and other potentially misleading protein sequences. , 1992, Genomics.

[18]  P. Green,et al.  Ancient conserved regions in new gene sequences and the protein databases. , 1993, Science.

[19]  Michael S. Waterman,et al.  Approximations to Profile Score Distributions , 1994, J. Comput. Biol..

[20]  Marvin B. Shapiro,et al.  RNA splice junctions of different classes of eukaryotes: sequence statistics and functional implications in gene expression. , 1987, Nucleic acids research.

[21]  E. Snyder,et al.  Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. , 1993, Nucleic acids research.

[22]  M. Gribskov,et al.  Sequence Analysis Primer , 1991 .

[23]  C. Sander,et al.  Comprehensive sequence analysis of the 182 predicted open reading frames of yeast chromosome III , 1992, Protein science : a publication of the Protein Society.

[24]  P. V. Hippel,et al.  Protein-DNA recognition: new perspectives and underlying themes , 1994 .

[25]  Victor V. Solovyev,et al.  Identification of Human Gene Functional Regions Based on Oligonucleotide Composition , 1993, ISMB.

[26]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[27]  E V Koonin,et al.  New genes in old sequence: a strategy for finding genes in the bacterial genome. , 1994, Trends in biochemical sciences.

[28]  J W Fickett,et al.  Distinctive sequence features in protein coding genic non-coding, and intergenic human DNA. , 1995, Journal of molecular biology.

[29]  A. Bairoch PROSITE: a dictionary of sites and patterns in proteins. , 1991, Nucleic acids research.

[30]  R. Doolittle,et al.  Of urfs and orfs , 1986 .

[31]  G. Stormo Consensus patterns in DNA. , 1990, Methods in enzymology.

[32]  I Sauvaget,et al.  K-tuple frequency analysis: from intron/exon discrimination to T-cell epitope mapping. , 1990, Methods in enzymology.

[33]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[34]  E. Wahle,et al.  The biochemistry of 3'-end cleavage and polyadenylation of messenger RNA precursors. , 1992, Annual review of biochemistry.

[35]  J. Locker,et al.  A dictionary of transcription control sequences. , 1990, DNA sequence : the journal of DNA sequencing and mapping.

[36]  A. D. McLachlan,et al.  Codon preference and its use in identifying protein coding regions in long DNA sequences , 1982, Nucleic Acids Res..

[37]  R. Durbin,et al.  2.2 Mb of contiguous nucleotide sequence from chromosome III of C. elegans , 1994, Nature.

[38]  M. Kozak,et al.  An analysis of vertebrate mRNA sequences: intimations of translational control , 1991, The Journal of cell biology.

[39]  Mikhail S. Gelfand,et al.  Prediction of Protein-Coding Regions in DNA of Higher Eukaryotes , 1992, Mathematical Methods Of Analysis Of Biopolymer Sequences.

[40]  A. Lapedes,et al.  Determination of eukaryotic protein coding regions using neural networks and information theory. , 1992, Journal of molecular biology.

[41]  C. Sander,et al.  Yeast chromosome III: new gene functions. , 1994, The EMBO journal.

[42]  N. Deacon,et al.  Relationship between the total size of exons and introns in protein-coding genes of higher eukaryotes. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Mark E. Dalphin,et al.  The translational termination signal database , 1993, Nucleic Acids Res..

[44]  G. Stormo,et al.  Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. , 1992, Journal of molecular biology.

[45]  M. Hayden,et al.  SORFIND: A computer program that predicts exons in vertebrate genomic DNA , 1993 .

[46]  M J Sternberg,et al.  Prediction of structural and functional features of protein and nucleic acid sequences by artificial neural networks. , 1992, Biochemistry.

[47]  R. Staden,et al.  The C. elegans genome sequencing project: a beginning , 1992, Nature.

[48]  M S Boguski,et al.  Gene discovery in dbEST. , 1994, Science.

[49]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[50]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[51]  A. Sarai,et al.  Analysis of the sequence-specific interactions between Cro repressor and operator DNA by systematic base substitution experiments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[53]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[54]  J. Claverie,et al.  CHAPTER THIRTY-SIX – Large-scale Sequence Analysis , 1994 .

[55]  J. Claverie,et al.  A streamlined random sequencing strategy for finding coding exons. , 1994, Genomics.

[56]  M. Mckeown,et al.  Alternative mRNA splicing. , 1992, Annual review of cell biology.

[57]  James W. Fickett,et al.  Inferring Genes From Open Reading Frames , 1994, Comput. Chem..

[58]  D. Searls,et al.  Gene structure prediction by linguistic methods. , 1994, Genomics.

[59]  Richard P. Lippmann,et al.  An introduction to computing with neural nets , 1987 .

[60]  D. Cavener,et al.  Eukaryotic start and stop translation sites. , 1991, Nucleic acids research.

[61]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[62]  Hamilton O. Smith,et al.  Finding sequence motifs in groups of functionally related proteins. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[63]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[64]  Michael S. Waterman,et al.  Introduction to Computational Biology: Maps, Sequences and Genomes , 1998 .

[65]  M S Gelfand,et al.  Computer prediction of the exon-intron structure of mammalian pre-mRNAs. , 1990, Nucleic acids research.

[66]  Jude W. Shavlik,et al.  Knowledge-Based Artificial Neural Networks , 1994, Artif. Intell..

[67]  R. Tjian,et al.  Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. , 1989, Science.

[68]  D. F. Morrison,et al.  Multivariate Statistical Methods , 1968 .

[69]  James W. Fickett,et al.  The Gene Identification Problem: An Overview for Developers , 1995, Comput. Chem..

[70]  George M. Church,et al.  Large scale bacterial gene discovery by similarity search , 1994, Nature Genetics.

[71]  Roderic Guigo,et al.  GENEID - A COMPUTER SERVER FOR PREDICTION OF GENES IN DNA SEQUENCES , 1993 .

[72]  S. Knudsen,et al.  Prediction of human mRNA donor and acceptor sites from the DNA sequence. , 1991, Journal of molecular biology.

[73]  Silke Meyer,et al.  Compilation of vertebrate-encoded transcription factors , 1992, Nucleic Acids Res..

[74]  P. Bucher Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. , 1990, Journal of molecular biology.

[75]  J.-M. CIaverie Database of ancient sequences , 1993, Nature.

[76]  M S Gelfand,et al.  Statistical analysis of mammalian pre-mRNA splicing sites. , 1989, Nucleic acids research.

[77]  Yin Xu,et al.  An Improved System for Exon Recognition and Gene Modeling in Human DNA Sequence , 1994, ISMB.

[78]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[79]  D. S. Prestridge Predicting Pol II promoter sequences using transcription factor binding sites. , 1995, Journal of molecular biology.

[80]  Ying Xu,et al.  Correcting sequencing errors in DNA coding regions using a dynamic programming approach , 1995, Comput. Appl. Biosci..

[81]  T. D. Schneider,et al.  Quantitative analysis of ribosome binding sites in E.coli. , 1994, Nucleic acids research.

[82]  R J Roberts,et al.  Predictive motifs derived from cytosine methyltransferases. , 1989, Nucleic acids research.

[83]  Chris Sander,et al.  What's in a genome? , 1992, Nature.

[84]  Jean-Michel Claverie,et al.  Some Useful Statistical Properties of Position-weight Matrices , 1994, Comput. Chem..

[85]  R. Staden Finding protein coding regions in genomic sequences. , 1990, Methods in enzymology.

[86]  M. Borodovsky,et al.  Intrinsic and extrinsic approaches for detecting genes in a bacterial genome. , 1994, Nucleic acids research.

[87]  A. D. McLachlan,et al.  A method for measuring the non-random bias of a codon usage table. , 1984, Nucleic acids research.

[88]  N L Harris,et al.  Splice junctions, branch point sites, and exons: sequence statistics, identification, and applications to genome project. , 1990, Methods in enzymology.

[89]  D. Ghosh,et al.  A relational database of transcription factors. , 1990, Nucleic acids research.

[90]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[91]  E. Uberbacher,et al.  Gene recognition and assembly in the GRAIL system: Progress and challenges , 1993 .

[92]  David R. Wolf,et al.  Base compositional structure of genomes. , 1992, Genomics.

[93]  David J. States,et al.  QGB: Combined Use of Sequence Similarity and Codon Bias for Coding Region Identification , 1994, J. Comput. Biol..

[94]  David Haussler,et al.  Optimally Parsing a Sequence into Different Classes Based on Multiple Types of Evidence , 1994, ISMB.

[95]  B. Dujon,et al.  The complete DNA sequence of yeast chromosome III , 1992, Nature.

[96]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[97]  C. Fields,et al.  Integrating Computational and Experimental Methods for Gene Discovery , 1994 .

[98]  M. Gelfand,et al.  Prediction of the exon-intron structure by a dynamic programming approach. , 1993, Bio Systems.

[99]  D C Shields,et al.  Codon usage patterns in Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster and Homo sapiens; a review of the considerable within-species diversity. , 1988, Nucleic acids research.

[100]  Alexander E. Kel,et al.  GenViewer: A computing tool for protein-coding regions prediction in nucleotide sequences , 1993 .

[101]  A L Goldberger,et al.  Correlation approach to identify coding regions in DNA sequences. , 1994, Biophysical journal.

[102]  M Kanehisa,et al.  Construction of a dictionary of sequence motifs that characterize groups of related proteins , 1992, Protein engineering.

[103]  Chris A. Fields,et al.  gm: a practical tool for automating DNA sequence analysis , 1990, Comput. Appl. Biosci..

[104]  Y. Ohshima,et al.  Signals for the selection of a splice site in pre-mRNA. Computer analysis of splice junction sequences and like sequences. , 1987, Journal of molecular biology.

[105]  J. Claverie,et al.  Detecting frame shifts by amino acid sequence comparison. , 1993, Journal of molecular biology.

[106]  P Chambon,et al.  Organization and expression of eucaryotic split genes coding for proteins. , 1981, Annual review of biochemistry.

[107]  A K Konopka,et al.  Complexity charts can be used to map functional domains in DNA. , 1990, Genetic analysis, techniques and applications.

[108]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[109]  Mark Borodovsky,et al.  Deriving Non-homogeneous DNA Markov Chain Models by Cluster Analysis Algorithm Minimizing Multiple Alignment Entropy , 1994, Comput. Chem..

[110]  R. F. Smith,et al.  Automatic generation of primary sequence patterns from sets of related protein sequences. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[111]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. , 1988, Trends in biochemical sciences.

[112]  R. KNÜPPEL,et al.  TRANSFAC Retrieval Program: A Network Model Database of Eukaryotic Transcription Regulating Sequences and Proteins , 1994, J. Comput. Biol..

[113]  Robert Entriken,et al.  Escherichia coli promoter sequences predict in vitro RNA polymerase selectivity , 1984, Nucleic Acids Res..

[114]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[115]  Victor V. Solovyev,et al.  The Prediction of Human Exons By Oligonucleotide Composition and Disriminant Analysis of Spliceable Open Reading Frames , 1994, ISMB.

[116]  James W. Fickett,et al.  ORFs and Genes: How Strong a Connection? , 1995, J. Comput. Biol..

[117]  H. Prydz,et al.  Evaluation of the exon predictions of the GRAIL software. , 1994, Genomics.