Predicting Protein Folding Classes without Overly Relying on Homology

An important open problem in molecular biology is how to use computational methods to understand the structure and function of proteins given only their primary sequences. We describe and evaluate an original machine-learning approach to classifying protein sequences according to their structural folding class. Our work is novel in several respects: we use a set of protein classes that previously have not been used for classifying primary sequences, and we use a unique set of attributes to represent protein sequences to the learners. We evaluate our approach by measuring its ability to correctly classify proteins that were not in its training set. We compare our input representation to a commonly used input representation--amino acid composition--and show that our approach more accurately classifies proteins that have very limited homology to the sequences on which the systems are trained.

[1]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[2]  A. Lehninger Principles of Biochemistry , 1984 .

[3]  K Nishikawa,et al.  Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. , 1994, Journal of molecular biology.

[4]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[5]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[6]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[7]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[8]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank. , 1991, Nucleic acids research.

[9]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[10]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[11]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[12]  K Nishikawa,et al.  The folding type of a protein is relevant to the amino acid composition. , 1986, Journal of biochemistry.

[13]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[14]  C. DeLisi,et al.  Prediction of protein structural class from the amino acid sequence , 1986, Biopolymers.

[15]  T. P. Flores,et al.  Identification and classification of protein fold families. , 1993, Protein engineering.

[16]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[17]  C. Chothia,et al.  Structural patterns in globular proteins , 1976, Nature.

[18]  J. Devereux,et al.  A comprehensive set of sequence analysis programs for the VAX , 1984, Nucleic Acids Res..

[19]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[20]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[21]  M. Adams,et al.  How many genes in the human genome? , 1994, Nature Genetics.

[22]  D. Eisenberg,et al.  The hydrophobic moment detects periodicity in protein hydrophobicity. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Cathy H. Wu,et al.  Neural Networks for Molecular Sequence Classification , 1993, ISMB.

[24]  M. Nadeau Proteins : Structure , Function , and Genetics , .

[25]  S H Kim,et al.  Prediction of protein folding class from amino acid composition , 1993, Proteins.

[26]  J. Skolnick,et al.  Discretized model of proteins. I. Monte Carlo study of cooperativity in homopolypeptides , 1992 .

[27]  Edgardo A. Ferrán,et al.  Protein Classification Using Neural Networks , 1993, ISMB.

[28]  D. Connelly,et al.  Cross‐validation of protein structural class prediction using statistical clustering and neural networks , 1993, Protein science : a publication of the Protein Society.

[29]  J. Richards The structure and action of proteins , 1969 .

[30]  Alberto L. Sangiovanni-Vincentelli,et al.  Efficient Parallel Learning Algorithms for Neural Networks , 1988, NIPS.

[31]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[32]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[33]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[34]  H. Bohr,et al.  The DEF data base of sequence based protein fold class predictions. , 1994, Nucleic acids research.