ProCKSI: a decision support system for Protein (Structure) Comparison, Knowledge, Similarity and Information

BackgroundWe introduce the decision support system for Protein (Structure) Comparison, Knowledge, Similarity and Information (ProCKSI). ProCKSI integrates various protein similarity measures through an easy to use interface that allows the comparison of multiple proteins simultaneously. It employs the Universal Similarity Metric (USM), the Maximum Contact Map Overlap (MaxCMO) of protein structures and other external methods such as the DaliLite and the TM-align methods, the Combinatorial Extension (CE) of the optimal path, and the FAST Align and Search Tool (FAST). Additionally, ProCKSI allows the user to upload a user-defined similarity matrix supplementing the methods mentioned, and computes a similarity consensus in order to provide a rich, integrated, multicriteria view of large datasets of protein structures.ResultsWe present ProCKSI's architecture and workflow describing its intuitive user interface, and show its potential on three distinct test-cases. In the first case, ProCKSI is used to evaluate the results of a previous CASP competition, assessing the similarity of proposed models for given targets where the structures could have a large deviation from one another. To perform this type of comparison reliably, we introduce a new consensus method. The second study deals with the verification of a classification scheme for protein kinases, originally derived by sequence comparison by Hanks and Hunter, but here we use a consensus similarity measure based on structures. In the third experiment using the Rost and Sander dataset (RS126), we investigate how a combination of different sets of similarity measures influences the quality and performance of ProCKSI's new consensus measure. ProCKSI performs well with all three datasets, showing its potential for complex, simultaneous multi-method assessment of structural similarity in large protein datasets. Furthermore, combining different similarity measures is usually more robust than relying on one single, unique measure.ConclusionBased on a diverse set of similarity measures, ProCKSI computes a consensus similarity profile for the entire protein set. All results can be clustered, visualised, analysed and easily compared with each other through a simple and intuitive interface.ProCKSI is publicly available at http://www.procksi.net for academic and non-commercial use.

[1]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[2]  Steven Skiena,et al.  Heterogeneous Data Integration with the Consensus Clustering Formalism , 2004, DILS.

[3]  Kent A. Spackman,et al.  Signal Detection Theory: Valuable Tools for Evaluating Inductive Learning , 1989, ML.

[4]  W. Taylor Protein structure comparison using iterated double dynamic programming , 2008, Protein science : a publication of the Protein Society.

[5]  Jens Meiler,et al.  CASP6 assessment of contact prediction , 2005, Proteins.

[6]  Alfonso Valencia,et al.  Implementing the iHOP concept for navigation of biomedical literature , 2005, ECCB/JBI.

[7]  Piero Fariselli,et al.  Reconstruction of 3D Structures From Protein Contact Maps , 2008, IEEE ACM Trans. Comput. Biol. Bioinform..

[8]  Philip E. Bourne,et al.  Con-Struct Map: a comparative contact map analysis tool , 2007, Bioinform..

[9]  Eytan Domany,et al.  Protein folding in contact map space , 2000 .

[10]  P. Kraulis A program to produce both detailed and schematic plots of protein structures , 1991 .

[11]  François-Joseph Lapointe,et al.  Matrix representations with parsimony or with distances: two sides of the same coin? , 2003, Systematic biology.

[12]  Liisa Holm,et al.  DaliLite workbench for protein structure comparison , 2000, Bioinform..

[13]  Christos H. Papadimitriou,et al.  Algorithmic aspects of protein structure similarity , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[14]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[15]  Raffaele Giancarlo,et al.  Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment , 2007, BMC Bioinformatics.

[16]  J. Szustakowski,et al.  Protein structure alignment using a genetic algorithm , 2000, Proteins.

[17]  F. Lapointe,et al.  Total evidence, consensus, and bat phylogeny: A distance-based approach. , 1999, Molecular phylogenetics and evolution.

[18]  T. Hunter,et al.  The Protein Kinase Complement of the Human Genome , 2002, Science.

[19]  Alfonso Valencia,et al.  Assessment of predictions submitted for the CASP6 comparative modeling category , 2005, Proteins.

[20]  C M Smith,et al.  The protein kinase resource and other bioinformation resources. , 1999, Progress in biophysics and molecular biology.

[21]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[22]  Serdar Tasiran,et al.  TreeJuxtaposer: scalable tree comparison using Focus+Context with guaranteed visibility , 2003, ACM Trans. Graph..

[23]  Thylogale,et al.  THE AVERAGE CONSENSUS PROCEDURE: COMBINATION OF WEIGHTED TREES CONTAINING IDENTICAL OR OVERLAPPING SETS OF TAXA , 2009 .

[24]  Joel Sokol,et al.  Optimal Protein Structure Alignment Using Maximum Cliques , 2005, Oper. Res..

[25]  P E Bourne,et al.  The protein kinase resource. , 1997, Trends in biochemical sciences.

[26]  K Henrick,et al.  Electronic Reprint Biological Crystallography Secondary-structure Matching (ssm), a New Tool for Fast Protein Structure Alignment in Three Dimensions Biological Crystallography Secondary-structure Matching (ssm), a New Tool for Fast Protein Structure Alignment in Three Dimensions , 2022 .

[27]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[28]  A. Valencia,et al.  A gene network for navigating the literature , 2004, Nature Genetics.

[29]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[30]  Irena Roterman-Konieczna,et al.  Search for structural similarity in proteins , 2003, Bioinform..

[31]  Jaume Bacardit,et al.  Prediction of topological contacts in proteins using learning classifier systems , 2008, Soft Comput..

[32]  Claudine Levasseur,et al.  Total Evidence, Average Consensus and Matrix Representation with Parsimony: What a Difference Distances Make , 2006, Evolutionary bioinformatics online.

[33]  Piero Fariselli,et al.  Reconstruction of 3D Structures from Protein Contact Maps , 2007, ISBRA.

[34]  Klara Kedem,et al.  Finding the Consensus Shape for a Protein Family , 2003, Algorithmica.

[35]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[36]  Adam Zemla,et al.  LGA: a method for finding 3D similarities in protein structures , 2003, Nucleic Acids Res..

[37]  Michael Y. Galperin The Molecular Biology Database Collection: 2006 update , 2005, Nucleic Acids Res..

[38]  Frances M. G. Pearl,et al.  The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis , 2004, Nucleic Acids Res..

[39]  N. Go,et al.  Common spatial arrangements of backbone fragments in homologous and non-homologous proteins. , 1992, Journal of molecular biology.

[40]  Nathan Linial,et al.  EVEREST: automatic identification and classification of protein domains in all protein sequences , 2006, BMC bioinformatics.

[41]  Federico Fogolari,et al.  Amino acid empirical contact energy definitions for fold recognition in the space of contact maps , 2003, BMC Bioinformatics.

[42]  T. Hunter,et al.  The eukaryotic protein kinase superfamily: kinase (catalytic) domain structure and classification 1 , 1995, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[43]  W. Kabsch A discussion of the solution for the best rotation to relate two sets of vectors , 1978 .

[44]  Tim J. P. Hubbard,et al.  MaxBench: evaluation of sequence and structure comparison methods , 2002, Bioinform..

[45]  Baldomero Oliva,et al.  Classification of common functional loops of kinase super‐families , 2004, Proteins.

[46]  H. Wolfson,et al.  Flexible protein alignment and hinge detection , 2002, Proteins.

[47]  Daniel Fischer,et al.  Servers for protein structure prediction. , 2006, Current opinion in structural biology.

[48]  Roland L. Dunbrack,et al.  CAFASP2: The second critical assessment of fully automated structure prediction methods , 2001, Proteins.

[49]  Michael Y. Galperin The Molecular Biology Database Collection: 2005 update , 2004, Nucleic Acids Res..

[50]  Ambuj K. Singh,et al.  Integrating multi-attribute similarity networks for robust representation of the protein space , 2006, Bioinform..

[51]  William R. Taylor,et al.  Flexible Secondary Structure Based Protein Structure Comparison Applied to the Detection of Circular Permutation , 2006, J. Comput. Biol..

[52]  M Levitt,et al.  Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins , 1998, Protein science : a publication of the Protein Society.

[53]  Burkhard Rost,et al.  EVAcon: a protein contact prediction evaluation service , 2005, Nucleic Acids Res..

[54]  Peter Willett,et al.  The use of graph theoretical methods for the comparison of the structures of biological macromolecules , 1995 .

[55]  Robert D. Carr,et al.  101 optimal PDB structure alignments: a branch-and-cut algorithm for the maximum contact map overlap problem , 2001, RECOMB.

[56]  Alfonso Valencia,et al.  Domain definition and target classification for CASP6 , 2005, Proteins.

[57]  Klara Kedem,et al.  Finding the Consensus Shape for a Protein Family , 2002, SCG '02.

[58]  Roland L. Dunbrack,et al.  CAFASP3: The third critical assessment of fully automated structure prediction methods , 2003, Proteins.

[59]  Edmund K. Burke,et al.  A fuzzy sets based generalization of contact maps for the overlap of protein structures , 2005, Fuzzy Sets Syst..

[60]  Jonathan Bingham,et al.  Visualizing large hierarchical clusters in hyperbolic space , 2000, Bioinform..

[61]  Nick V Grishin,et al.  Sequence and structure classification of kinases. , 2002, Journal of molecular biology.

[62]  Burkhard Rost,et al.  PROFcon: novel prediction of long-range contacts , 2005, Bioinform..

[63]  Natalio Krasnogor,et al.  Measuring the similarity of protein structures by means of the universal similarity metric , 2004, Bioinform..

[64]  Maciej Milostan,et al.  CASP6 data processing and automatic evaluation at the protein structure prediction center , 2005, Proteins.

[65]  Robert D. Carr,et al.  Alignment Of Protein Structures With A Memetic Evolutionary Algorithm , 2002, GECCO.

[66]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[67]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[68]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[69]  Tim J. P. Hubbard,et al.  SCOP: a structural classification of proteins database , 1998, Nucleic Acids Res..

[70]  C. Prigent,et al.  The Protein Kinase Resource: everything you always wanted to know about protein kinases but were afraid to ask , 2005, Biology of the cell.

[71]  H. Wolfson,et al.  An efficient automated computer vision based technique for detection of three dimensional structural motifs in proteins. , 1992, Journal of biomolecular structure & dynamics.

[72]  J. Hermans,et al.  A different best rigid-body molecular fit routine , 1977 .

[73]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[74]  G. Barton Scop: structural classification of proteins. , 1994, Trends in biochemical sciences.

[75]  Trevor J. Hastie,et al.  Regression analysis of multiple protein structures , 1998, RECOMB '98.

[76]  András Kocsor,et al.  Sequence analysis Application of compression-based distance measures to protein sequence classification : a methodological study , 2005 .

[77]  James P. Egan,et al.  Signal detection theory and ROC analysis , 1975 .

[78]  P. Koehl,et al.  Protein structure similarities. , 2001, Current opinion in structural biology.

[79]  Natalio Krasnogor,et al.  Self Generating Metaheuristics in Bioinformatics: The Proteins Structure Comparison Case , 2004, Genetic Programming and Evolvable Machines.

[80]  Alberto Caprara,et al.  Structural alignment of large—size proteins via lagrangian relaxation , 2002, RECOMB '02.

[81]  Sung-Hou Kim,et al.  Global mapping of the protein structure space and application in structure-based inference of protein function. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[82]  Adam Godzik,et al.  FATCAT: a web server for flexible structure comparison and structure similarity searching , 2004, Nucleic Acids Res..

[83]  Rachel Kolodny,et al.  Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. , 2005, Journal of molecular biology.

[84]  angesichts der Corona-Pandemie,et al.  UPDATE , 1973, The Lancet.

[85]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[86]  Robert D. Carr,et al.  1001 Optimal PDB Structure Alignments: Integer Programming Methods for Finding the Maximum Contact Map Overlap , 2004, J. Comput. Biol..

[87]  Serge A. Hazout,et al.  'Protein Peeling': an approach for splitting a 3D protein structure into compact fragments , 2006, Bioinform..

[88]  B Honig,et al.  An integrated approach to the analysis and modeling of protein sequences and structures. I. Protein structural alignment and a quantitative measure for protein structural distance. , 2000, Journal of molecular biology.

[89]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[90]  Nathan Linial,et al.  EVEREST: a collection of evolutionary conserved protein domains , 2006, Nucleic Acids Res..

[91]  Michal Linial,et al.  COMPACT: A Comparative Package for Clustering Assessment , 2005, ISPA Workshops.

[92]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[93]  Zhiping Weng,et al.  FAST: A novel protein structure alignment algorithm , 2004, Proteins.

[94]  Nick V Grishin,et al.  A comprehensive update of the sequence and structure classification of kinases , 2015 .

[95]  C. Sander,et al.  Detection of common three‐dimensional substructures in proteins , 1991, Proteins.