Efficient protein tertiary structure retrievals and classifications using content based comparison algorithms

Functionally important sites of proteins are potentially conserved to specific three-dimensional structural folds. To understand the structure-to-function relationship, life sciences researchers and biologists have a great need to retrieve similar structures from protein databases and classify these structures into the same protein fold. Traditional protein structure retrieval and classification methods are known to be either computationally expensive or labor intensive. In the past decade, more than 35000 protein structures have been identified. To meet the needs of fast retrieval and classifying high-throughput protein data, our research covers three main subjects: (1) Real-time global protein structure retrieval: We introduce an image-based approach that extracts signatures of three-dimensional protein structures. Our high-level protein signatures are then indexed by multi-dimensional indexing trees for fast retrieval. (2) Real-time global protein structure classification: An advanced knowledge discovery and data mining (KDD) model is proposed to convert high-level protein signature into itemsets for mining association rules. The advantage of this KDD approach is to effectively reveal the hidden knowledge from similar protein tertiary structures and quickly suggest possible SCOP domains for a newly-discovered protein. In addition, we develop a non-parametric classifier, E-Predict, that can rapidly assign known SCOP folds and recognize novel folds for newly-discovered proteins. (3) Efficient local protein structure retrieval and classification: We propose a novel algorithm, namely, the Index-based Protein Substructure Alignment (IPSA), that constructs a two-layer indexing tree to capture the obscured similarity of protein substructures in a timely fashion. Our research works exhibit significantly high efficiency with reasonably high accuracy and will benefit the study of high-throughput protein structure-function evolutionary relationships.

[1]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[2]  Ambuj K. Singh,et al.  Automated protein classification using consensus decision , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[3]  N. Alexandrov,et al.  SARFing the PDB. , 1996, Protein engineering.

[4]  C. Bron,et al.  Algorithm 457: finding all cliques of an undirected graph , 1973 .

[5]  Robert Tibshirani,et al.  Discriminant Adaptive Nearest Neighbor Classification , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Stephen K. Burley,et al.  An overview of structural genomics , 2000, Nature Structural Biology.

[7]  S. Kim,et al.  Structure-based assignment of the biochemical function of a hypothetical protein: a test case of structural genomics. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[8]  D. Brutlag,et al.  FoldMiner: Structural motif discovery using an improved superposition algorithm , 2004, Protein science : a publication of the Protein Society.

[9]  G. Kleywegt,et al.  Interactive motif and fold recognition in protein structures , 2002 .

[10]  Tal Pupko,et al.  In silico identification of functional regions in proteins , 2005, ISMB.

[11]  Dong Xu,et al.  A fast SCOP fold classification system using content-based E-Predict algorithm , 2005, BMC Bioinformatics.

[12]  Chi-Ren Shyu,et al.  Predicting Ranked SCOP Domains by Mining Associations of Visual Contents in Distance Matrices , 2005, APBC.

[13]  Yuan Qi,et al.  SCOPmap: Automated assignment of protein structures to evolutionary superfamilies , 2004, BMC Bioinformatics.

[14]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[15]  W R Taylor,et al.  Protein structure alignment. , 1989, Journal of molecular biology.

[16]  Douglas L. Brutlag,et al.  Hierarchical Protein Structure Superposition Using Both Secondary Structure and Atomic Representations , 1997, ISMB.

[17]  Nathan Linial,et al.  Approximate protein structural alignment in polynomial time. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[18]  J F Gibrat,et al.  Surprising similarities in structure comparison. , 1996, Current opinion in structural biology.

[19]  Jinn-Moon Yang,et al.  Protein structure database search and evolutionary classification , 2006, Nucleic acids research.

[20]  Janusz M. Bujnicki,et al.  Phylogeny of the Restriction Endonuclease-Like Superfamily Inferred from Comparison of Protein Structures , 2000, Journal of Molecular Evolution.

[21]  M. Gerstein,et al.  The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. , 1999, Journal of molecular biology.

[22]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[23]  Margaret H. Dunham,et al.  Data Mining: Introductory and Advanced Topics , 2002 .

[24]  S. Bryant,et al.  Threading a database of protein cores , 1995, Proteins.

[25]  Adam Godzik,et al.  FATCAT: a web server for flexible structure comparison and structure similarity searching , 2004, Nucleic Acids Res..

[26]  Narayanaswamy Srinivasan,et al.  Protein Block Expert (PBE): a web-based protein structure analysis server using a structural alphabet , 2006, Nucleic Acids Res..

[27]  Toshikazu Kato,et al.  Database architecture for content-based image retrieval , 1992, Electronic Imaging.

[28]  J. M. Miller,et al.  Representation of videokeratoscopic height data with Zernike polynomials. , 1995, Journal of the Optical Society of America. A, Optics, image science, and vision.

[29]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[30]  Frank Harary,et al.  Graph Theory , 2016 .

[31]  C. K. Chow,et al.  Boundary Detection of Radiographic Images by a Threshold Method , 1971, IFIP Congress.

[32]  Ambuj K. Singh,et al.  ProGreSS: Simultaneous Searching of Protein Databases by Sequence and Structure , 2004, Pacific Symposium on Biocomputing.

[33]  Christophe Combet,et al.  The SuMo server: 3D search for protein functional sites , 2005, Bioinform..

[34]  C. Sander,et al.  The FSSP database of structurally aligned protein fold families. , 1994, Nucleic acids research.

[35]  Kian-Lee Tan,et al.  Rapid 3D protein structure database searching using information retrieval techniques , 2004, Bioinform..

[36]  P. Røgen,et al.  Automatic classification of protein structure by using Gauss integrals , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[38]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[39]  C. Sander,et al.  A database of protein structure families with common folding motifs , 1992, Protein science : a publication of the Protein Society.

[40]  Srinivasan Parthasarathy,et al.  Alternate representation of distance matrices for characterization of protein structure , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[41]  A. Godzik The structural alignment between two proteins: Is there a unique answer? , 1996, Protein science : a publication of the Protein Society.

[42]  Jim Waldo Remote procedure calls and Java Remote Method Invocation , 1998, IEEE Concurr..

[43]  Ambuj K. Singh,et al.  Decision Tree Based Information Integration for Automated Protein Classification , 2005, J. Bioinform. Comput. Biol..

[44]  Conrad C. Huang,et al.  MINRMS: an efficient algorithm for determining protein structure similarity using root-mean-squared-distance , 2003, Bioinform..

[45]  Berthold K. P. Horn,et al.  Closed-form solution of absolute orientation using unit quaternions , 1987 .

[46]  Chi-Ren Shyu,et al.  EBS k-d Tree: An Entropy Balanced Statistical k-d Tree for Image Databases with Ground-Truth Labels , 2003, CIVR.

[47]  Dariusz Plewczynski,et al.  PDB-UF: database of predicted enzymatic functions for unannotated protein structures from structural genomics , 2006, BMC Bioinformatics.

[48]  I D Kuntz,et al.  A rapid method for exploring the protein structure universe , 1999, Proteins.

[49]  Christos Faloutsos,et al.  An Efficient Pictorial Database System for PSQL , 1988, IEEE Trans. Software Eng..

[50]  Adam Zemla,et al.  LGA: a method for finding 3D similarities in protein structures , 2003, Nucleic Acids Res..

[51]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[52]  James E. Bray,et al.  The CATH Database provides insights into protein structure/function relationships , 1999, Nucleic Acids Res..

[53]  U. Hobohm,et al.  Enlarged representative set of protein structures , 1994, Protein science : a publication of the Protein Society.

[54]  Jignesh M. Patel,et al.  A framework for protein structure classification and identification of novel protein structures , 2006, BMC Bioinformatics.

[55]  Osvaldo Olmea,et al.  MAMMOTH (Matching molecular models obtained from theory): An automated method for model comparison , 2002, Protein science : a publication of the Protein Society.

[56]  Michael A. Erdmann,et al.  Protein Similarity from Knot Theory: Geometric Convolution and Line Weavings , 2005, J. Comput. Biol..

[57]  Baldomero Oliva,et al.  Prediction of protein-protein interactions using distant conservation of sequence patterns and structure relationships , 2005, Bioinform..

[58]  Jon M. Kleinberg,et al.  Fast Detection of Common Geometric Substructure in Proteins , 1999, J. Comput. Biol..

[59]  Frances M. G. Pearl,et al.  Quantifying the similarities within fold space. , 2002, Journal of molecular biology.

[60]  Leszek Rychlewski,et al.  3D-Hit: fast structural comparison of proteins. , 2002, Applied bioinformatics.

[61]  Loris Nanni,et al.  An ensemble of K-local hyperplanes for predicting protein-protein interactions , 2006, Bioinform..

[62]  Ozlem Keskin,et al.  Prediction of protein-protein interactions by combining structure and sequence conservation in protein interfaces , 2005, Bioinform..

[63]  M. Vidal,et al.  Structural genomics: A pipeline for providing structures for the biologist , 2002, Protein science : a publication of the Protein Society.

[64]  James E. Bray,et al.  The CATH database: an extended protein family resource for structural and functional genomics , 2003, Nucleic Acids Res..

[65]  M. Bartlett Further aspects of the theory of multiple regression , 1938, Mathematical Proceedings of the Cambridge Philosophical Society.

[66]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[67]  Rangaraj M. Rangayyan,et al.  Content-based Retrieval of Mammograms Using Visual Features Related to Breast Density Patterns , 2007, Journal of Digital Imaging.

[68]  Ambuj K. Singh,et al.  PSI: indexing protein structures for fast similarity search , 2003, ISMB.

[69]  Andrew J. Martin,et al.  The ups and downs of protein topology; rapid comparison of protein structure. , 2000, Protein engineering.

[70]  W. Kabsch A solution for the best rotation to relate two sets of vectors , 1976 .

[71]  Man-Keung Siu,et al.  Introduction to graph theory (4th edition), by Robin J. Wilson. Pp. 171. £14.99. 1996. ISBN : 0-582-24993-7 (Longman). , 1998, The Mathematical Gazette.

[72]  Y. Ro,et al.  Remote method invocation based Web database system for global environment models , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[73]  Mohammed J. Zaki,et al.  Mining residue contacts in proteins using local structure predictions , 2003, IEEE Trans. Syst. Man Cybern. Part B.

[74]  Omran A. Bukhres,et al.  NASA Global Change Master Directory: an implementation of asynchronous management protocol in a heterogeneous distributed environment , 2001, Proceedings 3rd International Symposium on Distributed Objects and Applications.

[75]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[76]  Alexej Abyzov,et al.  Structural alignment of proteins by a novel TOPOFIT method, as a superimposition of common volumes at a topomax point , 2004, Protein science : a publication of the Protein Society.

[77]  Annabel E. Todd,et al.  Evolution of function in protein superfamilies, from a structural perspective. , 2001, Journal of molecular biology.

[78]  R Nussinov,et al.  Automated multiple structure alignment and detection of a common substructural motif , 2001, Proteins.

[79]  R. Stevens,et al.  Global Efforts in Structural Genomics , 2001, Science.

[80]  Yuan-Fang Wang,et al.  CTSS: a robust and efficient method for protein structure alignment based on local geometrical and biological features , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[81]  M. Levitt,et al.  Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core , 1993, Current Biology.

[82]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[83]  Kian-Lee Tan,et al.  Augmenting SSEs with structural properties for rapid protein structure comparison , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[84]  E. Dubois,et al.  Digital picture processing , 1985, Proceedings of the IEEE.

[85]  Xiaobo Zhou,et al.  Protein structure similarity from principle component correlation analysis , 2006, BMC Bioinformatics.

[86]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[87]  Robert M. Haralick,et al.  Textural Features for Image Classification , 1973, IEEE Trans. Syst. Man Cybern..

[88]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[89]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[90]  John D. Westbrook,et al.  TargetDB: a target registration database for structural genomics projects , 2004, Bioinform..

[91]  M. Levitt,et al.  A unified statistical framework for sequence comparison and structure comparison. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[92]  K Henrick,et al.  Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. , 2004, Acta crystallographica. Section D, Biological crystallography.

[93]  Rich Caruana,et al.  Data mining in metric space: an empirical analysis of supervised learning performance criteria , 2004, ROCAI.

[94]  M Levitt,et al.  Different protein sequences can give rise to highly similar folds through different stabilizing interactions , 1994, Protein science : a publication of the Protein Society.

[95]  Olivier Lichtarge,et al.  Accurate and scalable identification of functional sites by evolutionary tracing , 2004, Journal of Structural and Functional Genomics.

[96]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[97]  A. Valencia,et al.  Computational methods for the prediction of protein interactions. , 2002, Current opinion in structural biology.

[98]  Chi-Ren Shyu,et al.  A Fast Protein Structure Retrieval System Using Image-Based Distance Matrices and Multidimensional Index , 2004, BIBE.

[99]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[100]  K. Nishikawa,et al.  Protein structure comparison using the Markov transition model of evolution , 2000, Proteins.

[101]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[102]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[103]  Hanjo Täubig,et al.  A Fast Method for Motif Detection and Searching in a Protein Structure Database , 2003 .

[104]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[105]  Srinivasan Parthasarathy,et al.  Efficient discovery of common substructures in macromolecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[106]  J M Thornton,et al.  Domain assignment for protein structures using a consensus approach: Characterization and analysis , 1998, Protein science : a publication of the Protein Society.

[107]  G. Kleywegt,et al.  Halloween ... Masks and Bones , 1994 .

[108]  G. Kleywegt,et al.  Detecting folding motifs and similarities in protein structures. , 1997, Methods in enzymology.

[109]  Wei Wang,et al.  Accurate Classification of Protein Structural Families Using Coherent Subgraph Analysis , 2003, Pacific Symposium on Biocomputing.

[110]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[111]  Nagasuma R. Chandra,et al.  Comparison of protein structures by growing neighborhood alignments , 2007, BMC Bioinformatics.

[112]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.