Gene structure prediction using information on homologous protein sequence

In this paper a new approach for the prediction of protein coding gene structures is described. The principal scheme of prediction is as follows: first, the exons with the best potential are predicted in a sequence with unknown functions and a list of potential amino acid fragments coded by these exons is formed. Second, testing the homology between each amino acid fragment from the list and proteins from the SWISS-PROT database of amino acid sequences. One protein with the best homology is chosen out of all the homologous sequences. Third, reconstruction of the exon-intron structure, basing it on its homology with the chosen protein sequences. The method was tested on an independent control set (20 genes). The results were as follows: 21% of real exons were lost and 3% of non-real exons were found. This system can be used to refine the results of gene prediction systems, especially if highly homologous proteins are found in the amino acid sequence database.

[1]  Luciano Milanesi,et al.  Fast, statistically based alignment of amino acid sequences on the base of diagonal fragments of DOT-matrices , 1992, Comput. Appl. Biosci..

[2]  E. Snyder,et al.  Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. , 1993, Nucleic acids research.

[3]  Pavel A. Pevzner,et al.  Statistical distance between texts and filtration methods in sequence comparison , 1992, Comput. Appl. Biosci..

[4]  R C Mann,et al.  An artificial intelligence approach to DNA sequence feature recognition. , 1992, Trends in biotechnology.

[5]  M. Kanehisa,et al.  Prediction of splice junctions in mRNA sequences. , 1985, Nucleic acids research.

[6]  Victor V. Solovyev,et al.  Identification of Human Gene Functional Regions Based on Oligonucleotide Composition , 1993, ISMB.

[7]  N L Harris,et al.  Splice junctions, branch point sites, and exons: sequence statistics, identification, and applications to genome project. , 1990, Methods in enzymology.

[8]  Chris A. Fields,et al.  gm: a practical tool for automating DNA sequence analysis , 1990, Comput. Appl. Biosci..

[9]  S. B. Needleman,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 1989 .

[10]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[11]  Michael R. Hayden,et al.  The prediction of exons through an analysis of spliceable open reading frames , 1992, Nucleic Acids Res..

[12]  M. Gelfand,et al.  Prediction of the exon-intron structure by a dynamic programming approach. , 1993, Bio Systems.

[13]  V. Solovyev,et al.  Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. , 1994, Nucleic acids research.

[14]  Lawrence Hunter,et al.  Computationally Efficient Cluster Representation in Molecular Sequence Megaclassification , 1993, ISMB.

[15]  Alexander E. Kel,et al.  GenViewer: A computing tool for protein-coding regions prediction in nucleotide sequences , 1993 .

[16]  T L Blundell,et al.  Automated comparative modelling of protein structures. , 1994, Current opinion in biotechnology.

[17]  J. Claverie,et al.  Identifying coding exons by similarity search: alu-derived and other potentially misleading protein sequences. , 1992, Genomics.

[18]  J. Greer,et al.  Comparative modeling of homologous proteins. , 1991, Methods in enzymology.

[19]  D. Searls,et al.  Gene structure prediction by linguistic methods. , 1994, Genomics.

[20]  M D Adams,et al.  Genome sequence analysis: scientific objectives and practical strategies. , 1992, Trends in biotechnology.

[21]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[22]  Bruce Stillman,et al.  The p150 and p60 subunits of chromatin assemblyfactor I: A molecular link between newly synthesized histories and DNA replication , 1995, Cell.

[23]  I Sauvaget,et al.  K-tuple frequency analysis: from intron/exon discrimination to T-cell epitope mapping. , 1990, Methods in enzymology.

[24]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[25]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[26]  E V Koonin,et al.  New genes in old sequence: a strategy for finding genes in the bacterial genome. , 1994, Trends in biochemical sciences.

[27]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank. , 1991, Nucleic acids research.

[28]  M S Gelfand,et al.  Computer prediction of the exon-intron structure of mammalian pre-mRNAs. , 1990, Nucleic acids research.

[29]  G. Cameron,et al.  The EMBL data library. , 1988, Nucleic acids research.

[30]  Chris Sander,et al.  The HSSP database of protein structure-sequence alignments , 1993, Nucleic Acids Res..

[31]  L T Hunt,et al.  The PIR protein sequence database. , 1991, Nucleic acids research.

[32]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.

[33]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Marvin B. Shapiro,et al.  RNA splice junctions of different classes of eukaryotes: sequence statistics and functional implications in gene expression. , 1987, Nucleic acids research.

[35]  Tom Maniatis,et al.  The role of small nuclear ribonucleoprotein particles in pre-mRNA splicing , 1987, Nature.

[36]  S. Tavtigian,et al.  Complex structure and regulation of the P16 (MTS1) locus. , 1995, Cancer research.

[37]  S. Karlin,et al.  Applications and statistics for multiple high-scoring segments in molecular sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[38]  S. Knudsen,et al.  Prediction of human mRNA donor and acceptor sites from the DNA sequence. , 1991, Journal of molecular biology.