10 – Prediction of Human Gene Structure

An important problem in current molecular biology studies is gene identification. Due to recent progress in large-scale sequencing projects, gene identification programs have become widely used. The use of these programs can significantly simplify the analysis of newly sequenced DNA especially when applied in combination with experimental methods. The gene identification procedure is very complex owing to the structure of eukaryotic genes. The analysis of human genes cannot be considered merely as a linguistic analysis of the nucleotide string because the gene structure is made up of many other important features that include higher-order chromatin structure, the nonrandom nucleosome positioning along the DNA, the different features of the three-dimensional structure of the DNA (or RNA) and the torsional strain on the DNA induced by transcription. This chapter describes the most important aspects of gene structure prediction: functional sites in nucleotide sequence, functional regions in nucleotide sequences, protein-coding gene structure prediction, analysis of potential proteins coded by predicted genes, and RNA-coding gene structure prediction.

[1]  K Frech,et al.  Computer-assisted prediction, classification, and delimitation of protein binding sites in nucleic acids. , 1993, Nucleic acids research.

[2]  E. Snyder,et al.  Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. , 1993, Nucleic acids research.

[3]  Wen-Hsiung Li,et al.  Fundamentals of molecular evolution , 1990 .

[4]  Tom Maniatis,et al.  The role of small nuclear ribonucleoprotein particles in pre-mRNA splicing , 1987, Nature.

[5]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[6]  Thomas Werner,et al.  GenomeInspector: a new approach to detect correlation patterns of elements on genomic sequences , 1996, Comput. Appl. Biosci..

[7]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[8]  A. Weiner,et al.  Nonviral retroposons: genes, pseudogenes, and transposable elements generated by the reverse flow of genetic information. , 1986, Annual review of biochemistry.

[9]  M. Gelfand,et al.  Prediction of the exon-intron structure by a dynamic programming approach. , 1993, Bio Systems.

[10]  J W Fickett,et al.  Finding genes by computer: the state of the art. , 1996, Trends in genetics : TIG.

[11]  A Bairoch PROSITE: a dictionary of sites and patterns in proteins. , 1992, Nucleic acids research.

[12]  Edward N. Trifonov,et al.  Interfering contexts of regulatory sequence elements , 1996, Comput. Appl. Biosci..

[13]  Alexander E. Kel,et al.  GenViewer: A computing tool for protein-coding regions prediction in nucleotide sequences , 1993 .

[14]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Holger Karas,et al.  TRANSFAC: a database on transcription factors and their DNA binding sites , 1996, Nucleic Acids Res..

[16]  A. Bird,et al.  Number of CpG islands and genes in human and mouse. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[17]  M S Gelfand,et al.  Computer prediction of the exon-intron structure of mammalian pre-mRNAs. , 1990, Nucleic acids research.

[18]  T. Boulikas,et al.  A compilation and classification of DNA binding sites for protein transcription factors from vertebrates. , 1994, Critical reviews in eukaryotic gene expression.

[19]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[20]  R. Padgett,et al.  Conserved sequences in a class of rare eukaryotic nuclear introns with non-consensus splice sites. , 1994, Journal of molecular biology.

[21]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Michael Ruogu Zhang,et al.  Identification of protein coding regions in the human genome by quadratic discriminant analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[23]  R. Guigó,et al.  Computational gene identification , 1997, Journal of Molecular Medicine.

[24]  Stephen M. Mount,et al.  A catalogue of splice junction sequences. , 1982, Nucleic acids research.

[25]  I Sauvaget,et al.  K-tuple frequency analysis: from intron/exon discrimination to T-cell epitope mapping. , 1990, Methods in enzymology.

[26]  G. Fichant,et al.  A frameshift error detection algorithm for DNA sequencing projects. , 1995, Nucleic acids research.

[27]  E V Koonin,et al.  New genes in old sequence: a strategy for finding genes in the bacterial genome. , 1994, Trends in biochemical sciences.

[28]  Luciano Milanesi,et al.  Gene structure prediction using information on homologous protein sequence , 1996, Comput. Appl. Biosci..

[29]  S. Knudsen,et al.  G+C-rich tract in 5' end of human introns. , 1992, Journal of molecular biology.

[30]  R. Palmiter,et al.  Rat growth hormone gene introns stimulate nucleosome alignment in vitro and in transgenic mice. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[31]  E. Wahle,et al.  3'-end cleavage and polyadenylation of mRNA precursors. , 1995, Biochimica et biophysica acta.

[32]  S. Knudsen,et al.  Prediction of human mRNA donor and acceptor sites from the DNA sequence. , 1991, Journal of molecular biology.

[33]  J. Hawkins,et al.  A survey on intron and exon lengths. , 1988, Nucleic acids research.

[34]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[35]  M S Boguski,et al.  Gene discovery in dbEST. , 1994, Science.

[36]  J. Fickett Coordinate positioning of MEF2 and myogenin binding sites. , 1996, Gene.

[37]  Y Iida,et al.  Recognition patterns for exon-intron junctions in higher organisms as revealed by a computer search. , 1983, Journal of biochemistry.

[38]  R. Nussinov Conserved quartets near 5' intron junctions in primate nuclear pre-mRNA. , 1988, Journal of theoretical biology.

[39]  Michael R. Hayden,et al.  The prediction of exons through an analysis of spliceable open reading frames , 1992, Nucleic Acids Res..

[40]  P. Bucher Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. , 1990, Journal of molecular biology.

[41]  C. Rawlings,et al.  Identification and analysis of multigene families by comparison of exon fingerprints. , 1995, Journal of molecular biology.

[42]  M H Skolnick,et al.  A probabilistic model for detecting coding regions in DNA sequences. , 1994, IMA journal of mathematics applied in medicine and biology.

[43]  Chris A. Fields,et al.  gm: a practical tool for automating DNA sequence analysis , 1990, Comput. Appl. Biosci..

[44]  Luciano Milanesi,et al.  Hamming-Clustering method for signals prediction in 5' and 3' regions of eukaryotic genes , 1996, Comput. Appl. Biosci..

[45]  Victor V. Solovyev,et al.  Recognition of 3'-processing sites of human mRNA precursors , 1997, Comput. Appl. Biosci..

[46]  D. S. Prestridge Predicting Pol II promoter sequences using transcription factor binding sites. , 1995, Journal of molecular biology.

[47]  Jerzy Jurka,et al.  Censor - a Program for Identification and Elimination of Repetitive Elements From DNA Sequences , 1996, Comput. Chem..

[48]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. , 1987, Journal of molecular biology.

[49]  Nick Proudfoot,et al.  Poly(A) signals , 1991, Cell.

[50]  T. Werner,et al.  MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. , 1995, Nucleic acids research.

[51]  David J. States,et al.  QGB: Combined Use of Sequence Similarity and Codon Bias for Coding Region Identification , 1994, J. Comput. Biol..

[52]  W. H. Day,et al.  Critical comparison of consensus methods for molecular sequences. , 1992, Nucleic acids research.

[53]  Alexander E. Kel,et al.  Eukaryotic promoter recognition by binding sites for transcription factors , 1995, Comput. Appl. Biosci..

[54]  M J Sternberg,et al.  Prediction of structural and functional features of protein and nucleic acid sequences by artificial neural networks. , 1992, Biochemistry.

[55]  M. Kozak The scanning model for translation: an update , 1989, The Journal of cell biology.

[56]  N L Harris,et al.  Splice junctions, branch point sites, and exons: sequence statistics, identification, and applications to genome project. , 1990, Methods in enzymology.

[57]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[58]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.

[59]  Gary D. Stormo,et al.  MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices , 1995, Comput. Appl. Biosci..

[60]  Mikhail S. Gelfand,et al.  Prediction of Function in DNA Sequence , 1995, J. Comput. Biol..

[61]  V. Solovyev,et al.  Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. , 1994, Nucleic acids research.

[62]  Erik L. L. Sonnhammer,et al.  A workbench for large-scale sequence homology analysis , 1994, Comput. Appl. Biosci..

[63]  G. Stormo Computer methods for analyzing sequence recognition of nucleic acids. , 1988, Annual Review of Biophysics and Biophysical Chemistry.

[64]  E. Wingender,et al.  A compilation of composite regulatory elements affecting gene transcription in vertebrates. , 1995, Nucleic acids research.

[65]  Artemis G. Hatzigeorgiou,et al.  Computational analysis of transcriptional regulatory elements: a field in flux , 1996, Comput. Appl. Biosci..

[66]  W. Earnshaw,et al.  Structure of the human centromere at metaphase. , 1990, Trends in biochemical sciences.

[67]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[68]  H. Prydz,et al.  Evaluation of the exon predictions of the GRAIL software. , 1994, Genomics.

[69]  P. Willems Dynamic mutations hit double figures , 1994, Nature Genetics.

[70]  S Karlin,et al.  Assessments of DNA inhomogeneities in yeast chromosome III. , 1993, Nucleic acids research.

[71]  Mikhail S. Gelfand,et al.  Recognition of Genes in Human DNA Sequences , 1996, J. Comput. Biol..

[72]  Ying Xu,et al.  Correcting sequencing errors in DNA coding regions using a dynamic programming approach , 1995, Comput. Appl. Biosci..

[73]  E. Snyder,et al.  Identification of protein coding regions in genomic DNA. , 1995, Journal of molecular biology.

[74]  J. McLauchlan,et al.  The consensus sequence YGTGTTYY located downstream from the AATAAA signal is required for efficient formation of mRNA 3' termini. , 1985, Nucleic acids research.

[75]  Silke Meyer,et al.  Compilation of vertebrate-encoded transcription factors , 1992, Nucleic Acids Res..

[76]  E. Uberbacher,et al.  Discovering and understanding genes in human DNA sequence using GRAIL. , 1996, Methods in enzymology.

[77]  D. Searls,et al.  Gene structure prediction by linguistic methods. , 1994, Genomics.

[78]  David Ghosh,et al.  Status of the transcription factors database (TFD) , 1993, Nucleic Acids Res..

[79]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[80]  G. Bernardi,et al.  The isochore organization of the human genome. , 1989, Annual review of genetics.

[81]  James W. Fickett,et al.  The Gene Identification Problem: An Overview for Developers , 1995, Comput. Chem..

[82]  J W Fickett,et al.  Distinctive sequence features in protein coding genic non-coding, and intergenic human DNA. , 1995, Journal of molecular biology.