A Fast Protein Structure Retrieval System Using Image-Based Distance Matrices and Multidimensional Index

Indexing protein structures has been shown to provide a scalable solution for structure-to-structure comparisons in large protein structure retrieval systems. To conduct similarity searches against 46,075 polypeptide chains in a database with real-time responses, two critical issues must be addressed, information extraction and suitable indexing. In this paper, we apply computer vision techniques to extract the predominant information encoded in each 2D distance matrix, generated from 3D coordinates of protein chains. Distance matrices are capable of representing specific protein structural topologies, and similar proteins will generate similar matrices. Once meaningful features are extracted from distance images, an advanced indexing structure, entropy balanced statistical (EBS) k-d tree, can be utilized to index the multidimensional data. With a limited amount of training data from domain experts, namely structural classification of a subset of available protein chains, we apply various techniques in the pattern recognition field to determine clusters of proteins in the multi-dimensional feature space. Our system is able to recall search results in a ranked order from the protein database in seconds, exhibiting a reasonably high degree of precision.

[1]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Kian-Lee Tan,et al.  Augmenting SSEs with structural properties for rapid protein structure comparison , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[3]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[4]  C. Chothia,et al.  Structural patterns in globular proteins , 1976, Nature.

[5]  Till Nierhoff,et al.  Accelerating screening of 3D protein data with a graph theoretical approach , 2003, Bioinform..

[6]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[7]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[8]  Tosiyasu L. Kunii,et al.  Pictorial Data-Base Systems , 1981, Computer.

[9]  Ambuj K. Singh,et al.  PSI: indexing protein structures for fast similarity search , 2003, ISMB.

[10]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[11]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[12]  Chi-Ren Shyu,et al.  EBS k-d Tree: An Entropy Balanced Statistical k-d Tree for Image Databases with Ground-Truth Labels , 2003, CIVR.

[13]  Arnold W. M. Smeulders,et al.  Content-Based Image Retrieval , 2004 .

[14]  Omran A. Bukhres,et al.  NASA Global Change Master Directory: an implementation of asynchronous management protocol in a heterogeneous distributed environment , 2001, Proceedings 3rd International Symposium on Distributed Objects and Applications.

[15]  Shi-Kuo Chang,et al.  Pictorial Data-Base Systems , 1981, Computer.

[16]  Robert M. Haralick,et al.  Textural Features for Image Classification , 1973, IEEE Trans. Syst. Man Cybern..

[17]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[18]  A. Godzik The structural alignment between two proteins: Is there a unique answer? , 1996, Protein science : a publication of the Protein Society.

[19]  Y. Ro,et al.  Remote method invocation based Web database system for global environment models , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[20]  Azriel Rosenfeld,et al.  Digital Picture Processing , 1976 .

[21]  Carla E. Brodley,et al.  Using Human Perceptual Categories for Content-Based Retrieval from a Medical Image Database , 2002, Comput. Vis. Image Underst..

[22]  Yuan-Fang Wang,et al.  CTSS: a robust and efficient method for protein structure alignment based on local geometrical and biological features , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[23]  James M. Keller,et al.  A possibilistic approach to clustering , 1993, IEEE Trans. Fuzzy Syst..

[24]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[25]  Kian-Lee Tan,et al.  Rapid 3D protein structure database searching using information retrieval techniques , 2004, Bioinform..

[26]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.