Using a neural network to backtranslate amino acid sequences

1 Corresponding author A neural network (NN) was trained on amino and nucleic acid sequences to test the NN’s ability to predict a nucleic acid sequence given only an amino acid sequence. A multi-layer backpropagation network of one hidden layer with 5 to 9 neurons was used. Different network configurations were used with varying numbers of input neurons to represent amino acids, while a constant representation was used for the output layer representing nucleic acids. In the besttrained network, 93% of the overall bases, 85% of the degenerate bases, and 100% of the fixed bases were correctly predicted from randomly selected test sequences. The training set was composed of 60 human sequences in a window of 10 to 25 codons at the coding sequence start site. Different NN configurations involving the encoding of amino acids under increasing window sizes were evaluated to predict the behavior of the NN with a significantly larger training set. This genetic data analysis effort will assist in understanding human gene structure. Benefits include computational tools that could predict more reliably the backtranslation of amino acid sequences useful for Degenerate PCR cloning, and may assist the identification of human gene coding sequences (CDS) from open reading frames in DNA databases.

[1]  G. Zhou,et al.  Neural network optimization for E. coli promoter prediction. , 1991, Nucleic acids research.

[2]  N L Eberhardt A shell program for the design of PCR primers using genetics computer group (GCG) software (7.1) on VAX/VMS systems. , 1992, BioTechniques.

[3]  J. H. Nash A computer program to calculate and design oligonucleotide primers from amino acid sequences , 1993, Comput. Appl. Biosci..

[4]  Jian Sun,et al.  Analysis of tRNA Gene Sequences by Neural Network , 1995, J. Comput. Biol..

[5]  H Ogura,et al.  A study of learning splice sites of DNA sequence by neural networks. , 1997, Computers in biology and medicine.

[6]  S Karlin,et al.  Patchiness and correlations in DNA sequences , 1993, Science.

[7]  C. Peng,et al.  Long-range correlations in nucleotide sequences , 1992, Nature.

[8]  M. Karplus,et al.  Protein secondary structure prediction with a neural network. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[9]  R. Lathe Synthetic oligonucleotide probes deduced from amino acid sequence data. Theoretical and practical considerations. , 1985, Journal of molecular biology.

[10]  R. Woodruff Detection of codon usage patt erns for backtranslation using a neural network , 1998 .

[11]  S. Brunak,et al.  Neural network model of the genetic code is strongly correlated to the GES scale of amino acid transfer free energies. , 1994, Journal of molecular biology.

[12]  J M Chandonia,et al.  The importance of larger data sets for protein secondary structure prediction with neural networks , 1996, Protein science : a publication of the Protein Society.

[13]  M. O'Neill,et al.  Training back-propagation neural networks to define and detect DNA-binding sites. , 1991, Nucleic acids research.

[14]  E. Snyder,et al.  Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. , 1993, Nucleic acids research.

[15]  A. Lapedes,et al.  Application of neural networks and other machine learning algorithms to DNA sequence analysis , 1988 .

[16]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Philippe Tarroux,et al.  Detection of compositional constraints in nucleic acid sequences using neural networks , 1995, Comput. Appl. Biosci..