Computational methods for exon detection

Computer methods for the complete and accurate detection of genes in vertebrate genomic sequences are still a long way to perfection. The intermediate task of identifying the coding moiety of genes (coding exons) is now reasonably well achieved using a combination of methods. After reviewing the intrinsic difficulties in interpreting vertebrate genomic sequences, this article presents the state-of-the-art, with an emphasis on similarity search methods and the resources available through Internet.

[1]  P. Green,et al.  Ancient conserved regions in new gene sequences and the protein databases. , 1993, Science.

[2]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[3]  J. Ihle,et al.  Retroviral insertions in the murine His-1 locus activate the expression of a novel RNA that lacks an extensive open reading frame , 1994, Molecular and cellular biology.

[4]  M. Borodovsky,et al.  Intrinsic and extrinsic approaches for detecting genes in a bacterial genome. , 1994, Nucleic acids research.

[5]  N L Harris,et al.  Splice junctions, branch point sites, and exons: sequence statistics, identification, and applications to genome project. , 1990, Methods in enzymology.

[6]  J. Hawkins,et al.  A survey on intron and exon lengths. , 1988, Nucleic acids research.

[7]  A. Bairoch The ENZYME data bank. , 1993, Nucleic acids research.

[8]  M V Olson,et al.  The human genome project. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[9]  I Sauvaget,et al.  Objective comparison of exon and intron sequences by means of 2-dimensional data analysis methods. , 1988, Nucleic acids research.

[10]  E N Trifonov,et al.  Intervening sequences exhibit distinct vocabulary. , 1986, Journal of biomolecular structure & dynamics.

[11]  R. Durbin,et al.  2.2 Mb of contiguous nucleotide sequence from chromosome III of C. elegans , 1994, Nature.

[12]  J. C. Shepherd Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[13]  C. Lilley,et al.  A gene-rich cluster between the CD4 and triosephosphate isomerase genes at human chromosome 12p13. , 1996, Genome research.

[14]  C O'Brien Cancer genome anatomy project launched. , 1997, Molecular medicine today.

[15]  S. Tilghman,et al.  The structural H19 gene is required for transgene imprinting. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Daniel Gautheret,et al.  An RNA pattern matching program with enhanced performance and portability , 1994, Comput. Appl. Biosci..

[17]  J. Weissenbach,et al.  A 94 kb genomic sequence 3' to the murine Xist gene reveals an AT rich region containing a new testis specific gene Tsx. , 1996, Human molecular genetics.

[18]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[19]  J. Claverie,et al.  CHAPTER THIRTY-SIX – Large-scale Sequence Analysis , 1994 .

[20]  A. Monaco,et al.  The utrophin and dystrophin genes share similarities in genomic structure. , 1993, Human molecular genetics.

[21]  J. Claverie,et al.  A streamlined random sequencing strategy for finding coding exons. , 1994, Genomics.

[22]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[23]  R. Staden Finding protein coding regions in genomic sequences. , 1990, Methods in enzymology.

[24]  R. Staden,et al.  The C. elegans genome sequencing project: a beginning , 1992, Nature.

[25]  M. Gouy,et al.  Codon catalog usage and the genome hypothesis. , 1980, Nucleic acids research.

[26]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[27]  B Kuska,et al.  Cancer genome anatomy project set for take-off. , 1996, Journal of the National Cancer Institute.

[28]  Michael Ruogu Zhang,et al.  Identification of protein coding regions in the human genome by quadratic discriminant analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[29]  E. Snyder,et al.  Identification of protein coding regions in genomic DNA. , 1995, Journal of molecular biology.

[30]  A. Kerlavage,et al.  Complementary DNA sequencing: expressed sequence tags and human genome project , 1991, Science.

[31]  T. Bürglin,et al.  Introns in sequence tags , 1992, Nature.

[32]  J. Craig Venter,et al.  Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library , 1993, Nature Genetics.

[33]  Jean-Michel Claverie,et al.  Information Enhancement Methods for Large Scale Sequence Analysis , 1993, Comput. Chem..

[34]  J. Merlie,et al.  A novel synapse-associated noncoding RNA , 1994, Molecular and cellular biology.

[35]  J. Weissenbach,et al.  The candidate gene for the X-linked Kallmann syndrome encodes a protein related to adhesion molecules , 1991, Cell.

[36]  R. Siliciano,et al.  The human NTT gene: identification of a novel 17-kb noncoding nuclear RNA expressed in activated CD4+ T cells. , 1997, Genomics.

[37]  Submission of nucleotide sequence data to EMBL/GenBank/DDBJ. , 1996, Methods in molecular biology.

[38]  J. Craig Venter,et al.  3,400 new expressed sequence tags identify diversity of transcripts in human brain , 1993, Nature Genetics.

[39]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank: current status. , 1994, Nucleic acids research.

[40]  J. Claverie,et al.  Detecting frame shifts by amino acid sequence comparison. , 1993, Journal of molecular biology.

[41]  O. de Backer,et al.  Structure, chromosomal location, and expression pattern of three mouse genes homologous to the human MAGE genes. , 1995, Genomics.

[42]  A. Kuzminov,et al.  Study of plasmid replication in Escherichia coli with a combination of 2D gel electrophoresis and electron microscopy. , 1997, Journal of molecular biology.

[43]  R. Fleischmann,et al.  Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. , 1995, Nature.

[44]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[45]  S. Knudsen,et al.  Prediction of human mRNA donor and acceptor sites from the DNA sequence. , 1991, Journal of molecular biology.

[46]  J.-M. CIaverie Database of ancient sequences , 1993, Nature.

[47]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[48]  S. Kenwrick,et al.  Evidence for a third transcript from the human factor VIII gene. , 1992, Genomics.

[49]  C. Burks,et al.  Identifying potential tRNA genes in genomic DNA sequences. , 1991, Journal of molecular biology.

[50]  Thomas D. Wu A Segment-Based Dynamic Programming Algorithm for Predicting Gene Structure , 1996, J. Comput. Biol..

[51]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[52]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[53]  E. Mardis,et al.  Generation and analysis of 280,000 human expressed sequence tags. , 1996, Genome research.

[54]  H. Prydz,et al.  Evaluation of the exon predictions of the GRAIL software. , 1994, Genomics.

[55]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[56]  G. Stormo Consensus patterns in DNA. , 1990, Methods in enzymology.

[57]  I Sauvaget,et al.  K-tuple frequency analysis: from intron/exon discrimination to T-cell epitope mapping. , 1990, Methods in enzymology.

[58]  R. Gibbs,et al.  Large-scale sequencing in human chromosome 12p13: experimental and computational gene structure determination. , 1997, Genome research.

[59]  Jean-Michel Claverie,et al.  Heuristic informational analysis of sequences , 1986, Nucleic Acids Res..

[60]  Brendan P. Kehoe Zen and the art of the internet : a beginner's guide , 1993 .

[61]  K. O. Elliston,et al.  Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data. , 1996, Genome research.

[62]  Dominic P. Norris,et al.  The product of the mouse Xist gene is a 15 kb inactive X-specific transcript containing no conserved ORF and located in the nucleus , 1992, Cell.

[63]  J M Claverie,et al.  Effective large-scale sequence similarity searches. , 1996, Methods in enzymology.

[64]  J. Craig Venter,et al.  Sequence identification of 2,375 human brain genes , 1992, Nature.

[65]  B. Dujon,et al.  The complete DNA sequence of yeast chromosome III , 1992, Nature.

[66]  J. Claverie Computational methods for the identification of genes in vertebrate genomic sequences. , 1997, Human molecular genetics.

[67]  S Henikoff,et al.  Performance evaluation of amino acid substitution matrices , 1993, Proteins.

[68]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[69]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.

[70]  V. Solovyev,et al.  Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. , 1994, Nucleic acids research.

[71]  U. Francke,et al.  An imprinted mouse transcript homologous to the human imprinted in Prader-Willi syndrome (IPW) gene. , 1997, Human molecular genetics.

[72]  J. Claverie,et al.  Identifying coding exons by similarity search: alu-derived and other potentially misleading protein sequences. , 1992, Genomics.

[73]  Ying Xu,et al.  Constructing gene models from accurately predicted exons: an application of dynamic programming , 1994, Comput. Appl. Biosci..

[74]  Jean-Michel Claverie,et al.  Progress in Large-Scale Sequence Analysis , 1996 .

[75]  L. Hood,et al.  Large-scale and automated DNA sequence determination. , 1991, Science.

[76]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[77]  S. Altschul,et al.  Improved Sensitivity of Nucleic Acid Database Searches Using Application-Specific Scoring Matrices , 1991 .

[78]  I. Kawagishi,et al.  Very fast flagellar rotation , 1994, Nature.

[79]  Garry S. A. Myers,et al.  Internet for the molecular biologist , 1996 .

[80]  R. Nowak Bacterial genome sequence bagged. , 1995, Science.

[81]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[82]  C. Sensen,et al.  Complete DNA sequence of yeast chromosome XI , 1994, Nature.

[83]  Y Xu,et al.  Recognizing exons in genomic sequence using GRAIL II. , 1994, Genetic engineering.