Hamming-Clustering method for signals prediction in 5' and 3' regions of eukaryotic genes

MOTIVATION Gene expression is regulated by different kinds of short nucleotide domains. These features can either activate or terminate the transcription process. To predict the signal sites in the 5' and 3' gene regions we applied the Hamming-Clustering network (HC) to the TATA box, to the transcription initiation site and to the poly(A) signal determination in DNA sequences. This approach employs a technique deriving from the synthesis of digital networks in order to generate prototypes, or rules, which can be directly analysed or used for the construction of a final neural network. RESULTS More than 1000 poly-A signals have been extracted from EMBL database rel. 42 and used to build the training and the test set. A full set of the eukaryotic genes (1252 entry) from the Eukaryotic Promoter Database (EPD rel. 42) have been used for the TATA-box signal and transcription network approach. The results show the applicability of the Hamming-Clustering method to functional signal prediction.

[1]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[2]  Stephen K. Burley,et al.  Co-crystal structure of TBP recognizing the minor groove of a TATA element , 1993, Nature.

[3]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[4]  T. D. Schneider,et al.  Information analysis of sequences that bind the replication initiator RepA. , 1993, Journal of molecular biology.

[5]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[6]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[7]  Edward J. McCluskey,et al.  Design of Digital Computers , 1975, Texts and Monographs in Computer Science.

[8]  V. Solovyev,et al.  Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. , 1994, Nucleic acids research.

[9]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Chris A. Fields,et al.  gm: a practical tool for automating DNA sequence analysis , 1990, Comput. Appl. Biosci..

[11]  Gary D. Stormo,et al.  SIGNAL SCAN 3.0: new database and program features , 1993, Comput. Appl. Biosci..

[12]  Dan S. Prestridge,et al.  SIGNAL SCAN: a computer program that scans DNA sequences for eukaryotic transcriptional elements , 1991, Comput. Appl. Biosci..

[13]  Alexander E. Kel,et al.  Eukaryotic promoter recognition by binding sites for transcription factors , 1995, Comput. Appl. Biosci..

[14]  K Frech,et al.  Computer-assisted prediction, classification, and delimitation of protein binding sites in nucleic acids. , 1993, Nucleic acids research.

[15]  D G Higgins,et al.  The EMBL Data Library. , 1992, Nucleic acids research.

[16]  Léon Bottou,et al.  Local Learning Algorithms , 1992, Neural Computation.

[17]  Michael C. O'Neill,et al.  Escherichia coli promoters: neural networks develop distinct descriptions in learning to search for promoters of different spacing classes , 1992, Nucleic Acids Res..

[18]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[19]  Michael R. Hayden,et al.  The prediction of exons through an analysis of spliceable open reading frames , 1992, Nucleic Acids Res..

[20]  Philipp Bucher,et al.  The Eukaryotic Promoter Database EPD , 1998, Nucleic Acids Res..

[21]  M. Gelfand,et al.  Prediction of the exon-intron structure by a dynamic programming approach. , 1993, Bio Systems.

[22]  J. Claverie,et al.  Identifying coding exons by similarity search: alu-derived and other potentially misleading protein sequences. , 1992, Genomics.

[23]  Geoffrey E. Hinton,et al.  Learning representations by back-propagation errors, nature , 1986 .

[24]  D. Searls,et al.  Gene structure prediction by linguistic methods. , 1994, Genomics.

[25]  Alexander E. Kel,et al.  GenViewer: A computing tool for protein-coding regions prediction in nucleotide sequences , 1993 .

[26]  Steven Hahn,et al.  Crystal structure of a yeast TBP/TATA-box complex , 1993, Nature.

[27]  E. Snyder,et al.  Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. , 1993, Nucleic acids research.

[28]  P. Bucher Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. , 1990, Journal of molecular biology.

[29]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[30]  D. S. Prestridge Predicting Pol II promoter sequences using transcription factor binding sites. , 1995, Journal of molecular biology.