On the Use of Structure and Sequence-Based Features for Protein Classification and Retrieval

The need to retrieve or classify protein molecules using structure or sequence-based similarity measures underlies a wide range of biomedical applications. In drug discovery, researchers search for proteins that share specific chemical properties as possible sources for new treatment. With folding simulations, similar intermediate structures might be indicative of a common folding pathway. To derive any type of similarity, however, one must have an effective model of the protein that allows for easy comparison. In this work, we present two normalized, stand-alone representations of proteins that enable fast and efficient object retrieval based on sequence or structure. To create our sequence-based representation, we take the frequency and scoring matrices returned by the PSI-BLAST alignment algorithm and create a normalized summary using a discrete wavelet transform. Our structural descriptor is constructed using an algorithm we developed previously. First, we transform each 3D structure into a 2D distance matrix by calculating the pair-wise distance between the amino acids of a protein. We normalize this matrix and apply a 2D wavelet decomposition to generate a set of approximation coefficients, which serve as our feature vector. We also concatenate the sequence and structural descriptors together to create a hybrid solution. We evaluate the generality of our models by using them as database indices for nearest-neighbor and range-based retrieval experiments as well as feature vectors for classification using support vector machines. We find that our methods provide excellent performance when compared with the current state-of-the-art techniques of each task. Our results show that the sequence-based representation is on par with, or out-performs, the structure-based representation. Moreover, we find that in the classification context, the hybrid strategy affords a significant improvement over sequence or structure.

[1]  Charu C. Aggarwal,et al.  On the Use of Conceptual Reconstruction for Mining Massively Incomplete Data Sets , 2003, IEEE Trans. Knowl. Data Eng..

[2]  Srinivasan Parthasarathy,et al.  Structure-based querying of proteins using wavelets , 2006, CIKM '06.

[3]  Jason Weston,et al.  Multi-class protein fold recognition using adaptive codes , 2005, ICML.

[4]  Kian-Lee Tan,et al.  Rapid 3D protein structure database searching using information retrieval techniques , 2004, Bioinform..

[5]  Feng Gao,et al.  PSIST: indexing protein structures using suffix trees , 2005, 2005 IEEE Computational Systems Bioinformatics Conference (CSB'05).

[6]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[7]  Anthony K. H. Tung,et al.  Substructure clustering on sequential 3d object datasets , 2004, Proceedings. 20th International Conference on Data Engineering.

[8]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[9]  Ambuj K. Singh,et al.  ProGreSS: Simultaneous Searching of Protein Databases by Sequence and Structure , 2004, Pacific Symposium on Biocomputing.

[10]  Ambuj K. Singh,et al.  Towards index-based similarity search for protein structure databases , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[11]  Y. Freund,et al.  Profile-based string kernels for remote homology detection and motif extraction. , 2005, Journal of bioinformatics and computational biology.

[12]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[14]  Srinivasan Parthasarathy,et al.  MotifMiner: Efficient discovery of common substructures in biochemical molecules , 2005, Knowledge and Information Systems.

[15]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[16]  Jason Weston,et al.  Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.

[17]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[18]  Srinivasan Parthasarathy,et al.  A spatio-temporal mining approach towards summarizing and analyzing protein folding trajectories , 2007, Algorithms for Molecular Biology.

[19]  S. Mallat A wavelet tour of signal processing , 1998 .

[20]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[22]  George Karypis,et al.  Profile-based direct kernels for remote homology detection and fold recognition , 2005, Bioinform..

[23]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[24]  Chan-seok Jeong,et al.  Fold recognition by combining profile-profile alignment and support vector machine , 2005, Bioinform..