Classification Neural Networks for Rapid Sequence Annotation and Automated Database Organization

Abstract A neural network classification method has been developed as an alternative approach to the search/organization problem of large molecular databases. Two artificial neural systems have been implemented on a Cray for rapid protein/nucleic acid classification of unknown sequences. The system employs a n -gram hashing function for sequence encoding and modular back-propagation networks for classification. The protein system, which classifies proteins into PIR (Protein Identification Resource) superfamilies, has achieved 82–100% sensitivity at a speed that is about an order of magnitude faster than other search methods. The pilot nucleic acid system showed a 91–97% classification accuracy. The software tool could be used as a filter program to reduce the database search time and help organize the molecular sequence databases. The tool is generally applicable to any databases that are organized according to family relationships.

[1]  A. Bairoch PROSITE: a dictionary of sites and patterns in proteins. , 1991, Nucleic acids research.

[2]  Benny Lautrup,et al.  A novel approach to prediction of the 3‐dimensional structures of protein backbones by neural networks , 1990, NIPS.

[3]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[4]  R Langridge,et al.  Improvements in protein secondary structure prediction by an enhanced neural network. , 1990, Journal of molecular biology.

[5]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[6]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[7]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[8]  Michael C. O'Neill,et al.  Escherichia coli promoters: neural networks develop distinct descriptions in learning to search for promoters of different spacing classes , 1992, Nucleic Acids Res..

[9]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[10]  Cathy H. Wu,et al.  Protein classification artificial neural system , 1992, Protein science : a publication of the Protein Society.

[11]  Marin van Heel,et al.  A new family of powerful multivariate statistical sequence analysis techniques. , 1991 .

[12]  Simon A. Barton,et al.  A Matrix Method for Optimizing a Neural Network , 1991, Neural Computation.

[13]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[14]  Carl O. Pabo,et al.  New generation databases for molecular biology , 1987, Nature.

[15]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[16]  G. Zhou,et al.  Neural network optimization for E. coli promoter prediction. , 1991, Nucleic acids research.