Exact Protein Structure Classification Using the Maximum Contact Map Overlap Metric

In this work we propose a new distance measure for comparing two protein structures based on their contact map representations. We show that our novel measure, which we refer to as the maximum contact map overlap (max-CMO) metric, satisfies all properties of a metric on the space of protein representations. Having a metric in that space allows to avoid pairwise comparisons on the entire database and thus to significantly accelerate exploring the protein space compared to non-metric spaces. We show on a gold-standard classification benchmark set of 6,759 and 67,609 proteins, resp., that our exact k-nearest neighbor scheme classifies up to 95% and 99% of queries correctly. Our k-NN classification thus provides a promising approach for the automatic classification of protein structures based on contact map overlap.

[1]  Pritish Kamath,et al.  Using Dominances for Solving the Protein Family Identification Problem , 2011, WABI.

[2]  Rita Casadio,et al.  Algorithms in Bioinformatics, 5th International Workshop, WABI 2005, Mallorca, Spain, October 3-6, 2005, Proceedings , 2005, WABI.

[3]  Wouter Boomsma,et al.  Fast large-scale clustering of protein structures using Gauss integrals , 2012, Bioinform..

[4]  Ralf Zimmer,et al.  Systematic comparison of SCOP and CATH: a new gold standard for protein structure analysis , 2009, BMC Structural Biology.

[5]  Luisa Micó,et al.  A modification of the LAESA algorithm for approximated k-NN classification , 2003, Pattern Recognit. Lett..

[6]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[7]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[8]  Marcello Pelillo,et al.  Metrics For Attributed Graphs Based On The Maximal Similarity Common Subgraph , 2004, Int. J. Pattern Recognit. Artif. Intell..

[9]  J. Marcos Moreno-Vega,et al.  A simple and fast heuristic for protein structure comparison , 2008, BMC Bioinformatics.

[10]  Wei Xie,et al.  A Reduction-Based Exact Algorithm for the Contact Map Overlap Problem , 2007, J. Comput. Biol..

[11]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[12]  Rumen Andonov,et al.  Maximum Contact Map Overlap Revisited , 2011, J. Comput. Biol..

[13]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[14]  Robert D. Carr,et al.  1001 Optimal PDB Structure Alignments: Integer Programming Methods for Finding the Maximum Contact Map Overlap , 2004, J. Comput. Biol..

[15]  Rumen Andonov,et al.  CSA: comprehensive comparison of pairwise protein structure alignments , 2012, Nucleic Acids Res..

[16]  Natasa Przulj,et al.  GR-Align: fast and flexible alignment of protein 3D structures using graphlet degree similarity , 2014, Bioinform..

[17]  Horst Bunke,et al.  A graph distance metric based on the maximal common subgraph , 1998, Pattern Recognit. Lett..

[18]  A. Godzik,et al.  Regularities in interaction patterns of globular proteins. , 1993, Protein engineering.

[19]  R. Lathrop The protein threading problem with sequence amino acid interaction preferences is NP-complete. , 1994, Protein engineering.

[20]  P. Røgen,et al.  Automatic classification of protein structure by using Gauss integrals , 2002, Proceedings of the National Academy of Sciences of the United States of America.