Computational Gene nding

Computational methodology for nding genes and other functional sites in genomic DNA has evolved signi cantly over the last 20 years. Excellent recent surveys have been given by Gelfand [27], Fickett [20, 21], Guig o [31], Claverie [13], Milanesi and Rogosin [50], and Krogh [41]. Extensive bibliographies are available at http://linkage.rockefeller.edu/wli/gene/ and http://www-hto.usc.edu/software/procrustes/fans_ref/. Here we give only a very brief overview. Among the types of functional sites in genomic DNA that researchers have sought to recognize are splice sites, start and stop codons, branch points, promoters and terminators of transcription, polyadenylation sites, ribosomal binding sites, topoisomerase II binding sites, topoisomerase I cleavage sites, and various transcription factor binding sites [27]. Local sites such as these are called signals and methods for detecting them may be called signal sensors. Genomic DNA signals can be contrasted with extended and variable length regions such as exons and introns, which are recognized by di erent methods that may be called content sensors [64, 65].

[1]  H. Prydz,et al.  CpG islands as gene markers in the human genome. , 1992, Genomics.

[2]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[3]  M. O'Neill,et al.  Training back-propagation neural networks to define and detect DNA-binding sites. , 1991, Nucleic acids research.

[4]  A. Lapedes,et al.  Application of neural networks and other machine learning algorithms to DNA sequence analysis , 1988 .

[5]  G. Stormo Computer methods for analyzing sequence recognition of nucleic acids. , 1988, Annual Review of Biophysics and Biophysical Chemistry.

[6]  A. Krogh Two methods for improving performance of an HMM application for gene finding , 1997 .

[7]  Michael C. O'Neill,et al.  Escherichia coli promoters: neural networks develop distinct descriptions in learning to search for promoters of different spacing classes , 1992, Nucleic Acids Res..

[8]  David B. Searls,et al.  The Linguistics of DNA , 1992 .

[9]  M H Skolnick,et al.  A probabilistic model for detecting coding regions in DNA sequences. , 1994, IMA journal of mathematics applied in medicine and biology.

[10]  T J Gibson,et al.  PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. , 1996, Nucleic acids research.

[11]  David B. Searls,et al.  The computational linguistics of biological sequences , 1993, ISMB 1995.

[12]  J W Fickett,et al.  Finding genes by computer: the state of the art. , 1996, Trends in genetics : TIG.

[13]  M. Adams,et al.  A tool for analyzing and annotating genomic sequences. , 1997, Genomics.

[14]  Ying Xu,et al.  Inferring Gene Structures in Genomic Sequences Using Pattern Recognition and Expressed Sequence Tags , 1997, ISMB.

[15]  Thomas D. Wu A Segment-based Dynamic Programing Algorithm for Parsing Gene Structure ( Running Head : Segment-based Dynamic Programming ) , 1996 .

[16]  M S Gelfand,et al.  Prediction of function in DNA sequence analysis. , 1995, Journal of computational biology : a journal of computational molecular cell biology.

[17]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[18]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[19]  Kenneth H. Fasman,et al.  Finding Genes in Human DNA with a Hidden Markov Model , 1996, ISMB 1996.

[20]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[21]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[22]  Ewan Birney,et al.  Dynamite: A Flexible Code Generating Language for Dynamic Programming Methods Used in Sequence Comparison , 1997, ISMB.

[23]  Jean-Michel Claverie,et al.  Sequence "Signals": Artifact or Reality? , 1992, Comput. Chem..

[24]  Luciano Milanesi,et al.  10 – Prediction of Human Gene Structure , 1998 .

[25]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[26]  M. Gelfand,et al.  Prediction of the exon-intron structure by a dynamic programming approach. , 1993, Bio Systems.

[27]  Chris A. Fields,et al.  gm: a practical tool for automating DNA sequence analysis , 1990, Comput. Appl. Biosci..

[28]  Jean-Michel Claverie,et al.  Some Useful Statistical Properties of Position-weight Matrices , 1994, Comput. Chem..

[29]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[30]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[31]  Mark Craven,et al.  Learning to predict reading frames in E. coli DNA sequences , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[32]  A. Lapedes,et al.  Determination of eukaryotic protein coding regions using neural networks and information theory. , 1992, Journal of Molecular Biology.

[33]  Michael Q. Zhang,et al.  A weight array method for splicing signal analysis , 1993, Comput. Appl. Biosci..

[34]  J W Fickett,et al.  Distinctive sequence features in protein coding genic non-coding, and intergenic human DNA. , 1995, Journal of molecular biology.

[35]  V. Solovyev,et al.  Predicting internal exonsbyoligonucleotide composition anddiscriminant analysis ofspliceable open reading frames , 1994 .

[36]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[37]  A. Bairoch PROSITE: a dictionary of sites and patterns in proteins. , 1991, Nucleic acids research.

[38]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. , 1988, Trends in biochemical sciences.

[39]  A. Krogh 11 – Gene Finding: Putting the Parts Together , 1998 .

[40]  Jerzy Jurka,et al.  Censor - a Program for Identification and Elimination of Repetitive Elements From DNA Sequences , 1996, Computers and Chemistry.

[41]  Douglas L. Brutlag,et al.  Detection of Correlations in tRNA Sequences with Structural Implications , 1993, ISMB.

[42]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .

[44]  A. Zahler,et al.  Specific binding of an exonic splicing enhancer by the pre-mRNA splicing factor SRp55. , 1998, RNA.

[45]  R. Staden Finding protein coding regions in genomic sequences. , 1990, Methods in enzymology.

[46]  Mikhail S. Gelfand,et al.  Recognition of Genes in Human DNA Sequences , 1996, J. Comput. Biol..

[47]  B. Barrell,et al.  Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence , 1998, Nature.

[48]  Yin Xu,et al.  An Improved System for Exon Recognition and Gene Modeling in Human DNA Sequence , 1994, ISMB.

[49]  S. Knudsen,et al.  Prediction of human mRNA donor and acceptor sites from the DNA sequence. , 1991, Journal of molecular biology.

[50]  Ying Xu,et al.  Detection of RNA Polymerase II Promoters and Polyadenylation Sites in Human DNA Sequence , 1996, Comput. Chem..

[51]  M S Gelfand,et al.  Computer prediction of the exon-intron structure of mammalian pre-mRNAs. , 1990, Nucleic acids research.

[52]  M. G. Reese,et al.  NOVEL NEURAL NETWORK PREDICTION SYSTEMS FOR HUMAN PROMOTERS AND SPLICE SITES , 1995 .

[53]  Steven Salzberg,et al.  Locating Protein Coding Regions in Human DNA Using a Decision Tree Algorithm , 1995, J. Comput. Biol..

[54]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[55]  D Haussler,et al.  Integrating database homology in a probabilistic gene structure model. , 1997, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[56]  T. Heinemeyer,et al.  GenExpress: A Computer System for Description, Analysis and Recognition of Regulatory Sequences in Eukaryotic Genome , 1998, ISMB.

[57]  James W. Fickett,et al.  The Gene Identification Problem: An Overview for Developers , 1995, Comput. Chem..

[58]  M. Frommer,et al.  CpG islands in vertebrate genomes. , 1987, Journal of molecular biology.

[59]  D. Searls,et al.  Gene structure prediction by linguistic methods. , 1994, Genomics.

[60]  Edward C. Uberbacher,et al.  Automated Gene Identification in Large-Scale Genomic Sequences , 1997, J. Comput. Biol..

[61]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[62]  J. Claverie Computational methods for the identification of genes in vertebrate genomic sequences. , 1997, Human molecular genetics.

[63]  A. Bird CpG islands as gene markers in the vertebrate nucleus , 1987 .

[64]  R. Staden,et al.  The C. elegans genome sequencing project: a beginning , 1992, Nature.

[65]  J. Fickett,et al.  Eukaryotic promoter recognition. , 1997, Genome research.

[66]  R. Guigó,et al.  Computational gene identification , 1997, Journal of Molecular Medicine.

[67]  G. Stormo Consensus patterns in DNA. , 1990, Methods in enzymology.

[68]  Michael Ruogu Zhang,et al.  Statistical features of human exons and their flanking regions. , 1998, Human molecular genetics.

[69]  Pankaj Agarwal,et al.  The Ribosome Scanning Model for Translation Initiation: Implications for Gene Prediction and Full-Length cDNA Detection , 1998, ISMB.

[70]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[71]  E. Snyder,et al.  Identification of protein coding regions in genomic DNA. , 1995, Journal of molecular biology.