Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspectives for EST and Genome Analysis

Translation in eukaryotes does not always start at the first AUG in an mRNA, implying that context information also plays a role. This makes prediction of translation initiation sites a non-trivial task, especially when analysing EST and genome data where the entire mature mRNA sequence is not known. In this paper, we employ artificial neural networks to predict which AUG triplet in an mRNA sequence is the start codon. The trained networks correctly classified 88% of Arabidopsis and 85% of vertebrate AUG triplets. We find that our trained neural networks use a combination of local start codon context and global sequence information. Furthermore, analysis of false predictions shows that AUGs in frame with the actual start codon are more frequently selected than out-of-frame AUGs, suggesting that our networks use reading frame detection. A number of conflicts between neural network predictions and database annotations are analysed in detail, leading to identification of possible database errors.

[1]  O. Lund,et al.  Prediction of O-glycosylation of mammalian proteins: specificity patterns of UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase. , 1995, The Biochemical journal.

[2]  M. Kozak Point mutations define a sequence flanking the AUG initiator codon that modulates translation by eukaryotic ribosomes , 1986, Cell.

[3]  A. Cigan,et al.  Sequence and structural features associated with translational initiator regions in yeast--a review. , 1987, Gene.

[4]  R. Palmer,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[5]  S. Brunak,et al.  SHORT COMMUNICATION Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites , 1997 .

[6]  K. Yamauchi,et al.  The sequence flanking translational initiation site in protozoa. , 1991, Nucleic acids research.

[7]  M. Kozak,et al.  Translation of insulin-related polypeptides from messenger RNAs with tandemly reiterated copies of the ribosome binding site , 1983, Cell.

[8]  H F Kern,et al.  Selection of AUG initiation codons differs in plants and animals. , 1987, The EMBO journal.

[9]  D. Cavener,et al.  Comparison of the consensus sequence flanking translational start sites in Drosophila and vertebrates. , 1987, Nucleic acids research.

[10]  M. Kozak An analysis of 5'-noncoding sequences from 699 vertebrate messenger RNAs. , 1987, Nucleic acids research.

[11]  S. Knudsen,et al.  Neural network detects errors in the assignment of mRNA splice sites. , 1990, Nucleic acids research.

[12]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[13]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank: current status. , 1994, Nucleic acids research.

[14]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[15]  M. Kozak Compilation and analysis of sequences upstream from the translational start site in eukaryotic mRNAs. , 1984, Nucleic acids research.

[16]  M S Boguski,et al.  Gene discovery in dbEST. , 1994, Science.

[17]  Anders Gorm Pedersen,et al.  Investigations of Escherichia coli Promoter Sequences with Artificial Neural Networks: New Signals Discovered Upstream of the Transcriptional Startpoint , 1995, ISMB.

[18]  M. Kozak The scanning model for translation: an update , 1989, The Journal of cell biology.

[19]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[20]  D. Cavener,et al.  Eukaryotic start and stop translation sites. , 1991, Nucleic acids research.

[21]  Lawrence Hunter,et al.  Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology , 1993 .

[22]  S. Knudsen,et al.  Prediction of human mRNA donor and acceptor sites from the DNA sequence. , 1991, Journal of molecular biology.

[23]  Richard S. Bird,et al.  An introduction to the theory of lists , 1987 .

[24]  T. Donahue,et al.  MicroReview Control of translation initiation in Saccharomyces cerevisiae , 1992 .

[25]  C P Joshi,et al.  An inspection of the domain between putative TATA box and translation start site in 79 plant genes. , 1987, Nucleic acids research.

[26]  Franklin A. Graybill,et al.  Introduction to The theory , 1974 .

[27]  Steven E. Brenner,et al.  Proceedings Of The Third International Conference On Intelligent Systems For Molecular Biology , 1995 .

[28]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[29]  T F Donahue,et al.  Control of translation initiation in Saccharomyces cerevisiae. , 1992, Molecular microbiology.

[30]  F Quigley,et al.  Further progress towards a catalogue of all Arabidopsis genes: analysis of a set of 5000 non-redundant ESTs. , 1996, The Plant journal : for cell and molecular biology.