Self‐organized neural maps of human protein sequences

We have recently described a method based on artificial neural networks to cluster protein sequences into families. The network was trained with Kohonen's unsupervised learning algorithm using, as inputs, the matrix patterns derived from the dipeptide composition of the proteins. We present here a large‐scale application of that method to classify the 1,758 human protein sequences stored in the SwissProt database (release 19.0), whose lengths are greater than 50 amino acids. In the final 2‐dimensional topologically ordered map of 15 × 15 neurons, proteins belonging to known families were associated with the same neuron or with neighboring ones. Also, as an attempt to reduce the time‐consuming learning procedure, we compared 2 learning protocols: one of 500 epochs (100 SUN CPU‐hours [CPU‐h]), and another one of 30 epochs (6.7 CPU‐h). A further reduction of learning‐computing time, by a factor of about 3.3, with similar protein clustering results, was achieved using a matrix of 11×11 components to represent the sequences. Although network training is time consuming, the classification of a new protein in the final ordered map is very fast (14.6 CPU‐seconds). We also show a comparison between the artificial neural network approach and conventional methods of biosequence analysis.

[1]  G. Heijne Computer analysis of DNA and protein sequences , 1991 .

[2]  H. Macfie,et al.  An application of unsupervised neural network methodology Kohonen topology-Preserving mapping) to QSAR analysis , 1991 .

[3]  S F Altschul,et al.  Protein database searches for multiple alignments. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[4]  J. Mesirov,et al.  Hybrid system for protein secondary structure prediction. , 1992, Journal of molecular biology.

[5]  A V Lukashin,et al.  Neural network models for promoter recognition. , 1989, Journal of biomolecular structure & dynamics.

[6]  Jonathan D. Hirst,et al.  Prediction of ATP-binding motifs: a comparison of a perceptron-type neural network and a consensus sequence method , 1991 .

[7]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[8]  John Maddox Ever-longer sequences in prospect , 1992, Nature.

[9]  Patrizio Arrigo,et al.  Identification of a new motif on nucleic acid sequence data using Kohonen's self-organizing map , 1991, Comput. Appl. Biosci..

[10]  E. Snyder,et al.  Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. , 1993, Nucleic acids research.

[11]  S. Knudsen,et al.  G+C-rich tract in 5' end of human introns. , 1992, Journal of molecular biology.

[12]  T. Kohonen Self-organized formation of topographically correct feature maps , 1982 .

[13]  Shin-ichi Nakayama,et al.  Method for clustering proteins by use of all possible pairs of amino acids as structural descriptors , 1988, J. Chem. Inf. Comput. Sci..

[14]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[15]  M Kanehisa,et al.  An assessment of neural network and statistical approaches for prediction of E. coli promoter sites. , 1992, Nucleic acids research.

[16]  F. Corpet Multiple sequence alignment with hierarchical clustering. , 1988, Nucleic acids research.

[17]  Marin van Heel,et al.  A new family of powerful multivariate statistical sequence analysis techniques. , 1991 .

[18]  M J Sternberg,et al.  Prediction of structural and functional features of protein and nucleic acid sequences by artificial neural networks. , 1992, Biochemistry.

[19]  Cathy H. Wu,et al.  Protein classification artificial neural system , 1992, Protein science : a publication of the Protein Society.

[20]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[21]  M. Karplus,et al.  Protein secondary structure prediction with a neural network. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[22]  M Vieth,et al.  Prediction of protein secondary structure by an enhanced neural network. , 1991, Acta biochimica Polonica.

[23]  B. Rost,et al.  Improved prediction of protein secondary structure by use of sequence profiles and neural networks. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[24]  G. Zhou,et al.  Neural network optimization for E. coli promoter prediction. , 1991, Nucleic acids research.

[25]  M J Sternberg,et al.  Prediction of ATP/GTP-binding motif: a comparison of a perceptron type neural network and a consensus sequence method [corrected]. , 1991, Protein engineering.

[26]  Yoshua Bengio,et al.  Efficient recognition of immunoglobulin domains from amino acid sequences using a neural network , 1990, Comput. Appl. Biosci..

[27]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[28]  R. Staden,et al.  The C. elegans genome sequencing project: a beginning , 1992, Nature.

[29]  A. Lapedes,et al.  Determination of eukaryotic protein coding regions using neural networks and information theory. , 1992, Journal of molecular biology.

[30]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[31]  S. Knudsen,et al.  Neural network detects errors in the assignment of mRNA splice sites. , 1990, Nucleic acids research.

[32]  Edgardo A. Ferrán,et al.  A hybrid method to cluster protein sequences based on statistics and artificial neural networks , 1993, Comput. Appl. Biosci..

[33]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[34]  Michael C. O'Neill,et al.  Escherichia coli promoters: neural networks develop distinct descriptions in learning to search for promoters of different spacing classes , 1992, Nucleic Acids Res..

[35]  István Csabai,et al.  Improving signal peptide prediction accuracy by simulated neural network , 1991, Comput. Appl. Biosci..

[36]  P Ramond,et al.  Molecular cloning of the MCP-3 chemokine gene and regulation of its expression. , 1993, European cytokine network.

[37]  S. Brunak,et al.  Analysis of the secondary structure of the human immunodeficiency virus (HIV) proteins p17, gp120, and gp41 by computer modeling based on neural network methods. , 1990, Journal of acquired immune deficiency syndromes.

[38]  K. Matsushima,et al.  Properties of the novel proinflammatory supergene "intercrine" cytokine family. , 1991, Annual review of immunology.

[39]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[40]  Benny Lautrup,et al.  Training neural networks to analyse biological sequences. , 1990, Trends in biotechnology.

[41]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[42]  P Stolorz,et al.  Predicting protein secondary structure using neural net and statistical methods. , 1992, Journal of molecular biology.

[43]  S H Kim,et al.  Predicting protein secondary structure content. A tandem neural network approach. , 1992, Journal of molecular biology.

[44]  R Langridge,et al.  Improvements in protein secondary structure prediction by an enhanced neural network. , 1990, Journal of molecular biology.

[45]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[46]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[47]  J. D. Watson The human genome project: past, present, and future. , 1990, Science.

[48]  P. Argos,et al.  Recognition of distantly related protein sequences using conserved motifs and neural networks. , 1992, Journal of molecular biology.

[49]  Desmond G. Higgins,et al.  Fast and sensitive multiple sequence alignments on a microcomputer , 1989, Comput. Appl. Biosci..

[50]  S. Brunak,et al.  Protein secondary structure and homology by neural networks The α‐helices in rhodopsin , 1988 .

[51]  Steven M. Muskal,et al.  Prediction of the disulfide-bonding state of cysteine in proteins. , 1990, Protein engineering.

[52]  M. O'Neill,et al.  Training back-propagation neural networks to define and detect DNA-binding sites. , 1991, Nucleic acids research.

[53]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[54]  S. Knudsen,et al.  Prediction of human mRNA donor and acceptor sites from the DNA sequence. , 1991, Journal of molecular biology.

[55]  Edgardo A. Ferrán,et al.  A neural network dynamics that resembles protein evolution , 1992 .

[56]  Edgardo A. Ferrán,et al.  Clustering proteins into families using artificial neural networks [published erratum appears in Comput Appl Biosci 1992 Jun;8(3): 305] , 1992, Comput. Appl. Biosci..

[57]  S H Kim,et al.  Predicting surface exposure of amino acids from protein sequence. , 1990, Protein engineering.

[58]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[59]  J. Devereux,et al.  A comprehensive set of sequence analysis programs for the VAX , 1984, Nucleic Acids Res..

[60]  Benny Lautrup,et al.  A novel approach to prediction of the 3‐dimensional structures of protein backbones by neural networks , 1990, NIPS.

[61]  Rebecca C. Wade,et al.  Prediction of water binding sites on proteins by neural networks , 1992 .

[62]  Malcolm J. McGregor,et al.  Prediction of β-turns in proteins using neural networks , 1989 .