Alternate representation of distance matrices for characterization of protein structure

The most suitable method for the automated classification of protein structures remains an open problem in computational biology. In order to classify a protein structure with any accuracy, an effective representation must be chosen. Here we present two methods of representing protein structure. One involves representing the distances between the C/sub a/ atoms of a protein as a two-dimensional matrix and creating a model of the resulting surface with Zernike polynomials. The second uses a wavelet-based approach. We convert the distances between a protein's C/sub a/ atoms into a one-dimensional signal which is then decomposed using a discrete wavelet transformation. Using the Zernike coefficients and the approximation coefficients of the wavelet decomposition as feature vectors, we test the effectiveness of our representation with two different classifiers on a dataset of more than 600 proteins taken from the 27 most-populated SCOP folds. We find that the wavelet decomposition greatly outperforms the Zernike model. With the wavelet representation, we achieve an accuracy of approximately 56%, roughly 12% higher than results reported on a similar, but less-challenging dataset. In addition, we can couple our structure-based feature vectors with several sequence-based properties to increase accuracy another 5-7%. Finally, we use a multi-stage classification strategy on the combined features to increase performance to 78%, an improvement in accuracy of more than 15-20% and 34% over the highest reported sequence-based and structure-based classification results, respectively.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Kian-Lee Tan,et al.  Automatic protein structure classification through structural fingerprinting , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[3]  Wei Wang,et al.  Accurate Classification of Protein Structural Families Using Coherent Subgraph Analysis , 2003, Pacific Symposium on Biocomputing.

[4]  A F Laine,et al.  Wavelets in temporal and spatial processing of biomedical images. , 2000, Annual review of biomedical engineering.

[5]  S. Parthasarathy,et al.  Automated Classification of Keratoconus : A Case Study in Analyzing Clinical Data , 2022 .

[6]  Itay Lotan,et al.  Approximation of protein structure for fast similarity measures , 2003, RECOMB '03.

[7]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[8]  Srinivasan Parthasarathy,et al.  A multi-level approach to SCOP fold recognition , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[9]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[10]  J. M. Miller,et al.  Representation of videokeratoscopic height data with Zernike polynomials. , 1995, Journal of the Optical Society of America. A, Optics, image science, and vision.

[11]  Yves Deville,et al.  Multi-class protein fold classification using a new ensemble machine learning approach. , 2003, Genome informatics. International Conference on Genome Informatics.

[12]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[13]  D. R. Iskander,et al.  Optimal modeling of corneal surfaces with Zernike polynomials , 2001, IEEE Transactions on Biomedical Engineering.

[14]  Ankush Mittal,et al.  Protein Structure and Fold Prediction Using Tree-augmented Naïve Bayesian Classifier , 2005, J. Bioinform. Comput. Biol..

[15]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[16]  S. Mallat A wavelet tour of signal processing , 1998 .

[17]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[18]  Kalyanmoy Deb,et al.  Multi-Class Protein Fold Recognition Using Multi-Objective Evolutionary Algorithms , 2004 .

[19]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[20]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Mark Gerstein,et al.  Using Iterative Dynamic Programming to Obtain Accurate Pairwise and Multiple Alignments of Protein Structures , 1996, ISMB.

[22]  P. Røgen,et al.  Automatic classification of protein structure by using Gauss integrals , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[23]  I. Daubechies Ten Lectures on Wavelets , 1992 .