A statistical model to correct systematic bias introduced by algorithmic thresholds in protein structural comparison algorithms

The identification of protein function is crucial to understanding cellular processes and selecting novel proteins as drug targets. However, experimental methods for determining protein function can be expensive and time-consuming. Protein partial structure comparison methods seek to guide and accelerate the process of function determination by matching characterized functional site representations, motifs, to substructures within uncharacterized proteins, matches. One common difficulty of all protein structural comparison techniques is the computational cost of obtaining a match. In an effort to maintain practical efficiency, some algorithms employ efficient geometric threshold-based searches to eliminate biologically irrelevant matches. Thresholds refine and accelerate the method by limiting the number of potential matches that need to be considered. However, because statistical models rely on the output of the geometric matching method to accurately measure statistical significance, geometric thresholds can also artificially distort the basis of statistical models, making statistical scores dependant on geometric thresholds and potentially causing significant reductions in accuracy of the functional annotation method. This paper proposes a point-weight based correction approach to quantify and model the dependence of statistical scores to account for the systematic bias introduced by heuristics. Using a benchmark dataset of 20 structural motifs, we show that the point-weight correction procedure accurately models the information lost during the geometric comparison phase, removing systematic bias and greatly reducing misclassification rates of functionally related proteins, while maintaining specificity.

[1]  Sung-Hou Kim,et al.  Overview of structural genomics: from structure to function. , 2003, Current opinion in chemical biology.

[2]  Qiang Wang,et al.  An integrated database for complex protein structure modeling , 2008, 2008 IEEE International Conference on Bioinformatics and Biomeidcine Workshops.

[3]  F. Cohen,et al.  An evolutionary trace method defines binding surfaces common to protein families. , 1996, Journal of molecular biology.

[4]  J. Thornton,et al.  Predicting protein function from sequence and structural data. , 2005, Current opinion in structural biology.

[5]  Jie Liang,et al.  pvSOAR: detecting similar surface patterns of pocket and void surfaces of amino acid residues on proteins , 2004, Nucleic Acids Res..

[6]  Duncan P. Brown,et al.  Subfamily HMMS in Functional Genomics , 2004, Pacific Symposium on Biocomputing.

[7]  Tipton Kf,et al.  Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB). Enzyme nomenclature. Recommendations 1992. Supplement: corrections and additions. , 1994 .

[8]  Lydia E. Kavraki,et al.  Prediction of enzyme function based on 3D templates of evolutionarily important amino acids , 2008, BMC Bioinformatics.

[9]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[10]  O. Lichtarge,et al.  Evolutionary predictions of binding surfaces and interactions. , 2002, Current opinion in structural biology.

[11]  J. Tainer,et al.  Structures of the N(omega)-hydroxy-L-arginine complex of inducible nitric oxide synthase oxygenase dimer with active and inactive pterins. , 2000, Biochemistry.

[12]  M. C. Jones,et al.  A Brief Survey of Bandwidth Selection for Density Estimation , 1996 .

[13]  D. Blow,et al.  Mechanism for aldose-ketose interconversion by D-xylose isomerase involving ring opening followed by a 1,2-hydride shift. , 1993, Journal of molecular biology.

[14]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[15]  S Cusack,et al.  Crystal structure analysis of the activation of histidine by Thermus thermophilus histidyl-tRNA synthetase. , 1997, Biochemistry.

[16]  R. Russell,et al.  Detection of protein three-dimensional side-chain patterns: new examples of convergent evolution. , 1998, Journal of molecular biology.

[17]  D G Vassylyev,et al.  Crystal structures of phenylalanyl-tRNA synthetase complexed with phenylalanine and a phenylalanyl-adenylate analogue. , 1999, Journal of molecular biology.

[18]  Janet M. Thornton,et al.  An algorithm for constraint-based structural template matching: application to 3D templates with statistical analysis , 2003, Bioinform..

[19]  P. Willett,et al.  A graph-theoretic approach to the identification of three-dimensional patterns of amino acid side-chains in protein structures. , 1994, Journal of molecular biology.

[20]  Bradley E. Bernstein,et al.  Synergistic effects of substrate-induced conformational changes in phosphoglycerate kinase activation , 1997, Nature.

[21]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[22]  O. Lichtarge,et al.  A family of evolution-entropy hybrid methods for ranking protein residues by importance. , 2004, Journal of molecular biology.

[23]  J. Coleman,et al.  Structure and mechanism of alkaline phosphatase. , 1992, Annual review of biophysics and biomolecular structure.

[24]  Lydia E. Kavraki,et al.  LabelHash: A Flexible and Extensible Method for Matching Structural Motifs , 2008 .

[25]  D. R. Holland,et al.  Structural analysis of zinc substitutions in the active site of thermolysin , 1995, Protein science : a publication of the Protein Society.

[26]  Lydia E. Kavraki,et al.  Geometric Sieving: Automated Distributed Optimization of 3D Motifs for Protein Function Prediction , 2006, RECOMB.

[27]  S. Brenner A tour of structural genomics , 2001, Nature Reviews Genetics.

[28]  Lydia E. Kavraki,et al.  Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs , 2004, Pacific Symposium on Biocomputing.

[29]  Lydia E. Kavraki,et al.  The MASH Pipeline for Protein Function Prediction and an Algorithm for the Geometric Refinement of 3D Motifs , 2007, J. Comput. Biol..

[30]  D. Moras,et al.  Glycyl-tRNA synthetase uses a negatively charged pit for specific recognition and activation of glycine. , 1999, Journal of molecular biology.

[31]  E. Webb Enzyme nomenclature 1992. Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes. , 1992 .

[32]  Robert B Russell,et al.  A model for statistical significance of local similarities in structure. , 2003, Journal of molecular biology.

[33]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[34]  Olivier Lichtarge,et al.  Recurrent use of evolutionary importance for functional annotation of proteins based on local structural similarity , 2006, Protein science : a publication of the Protein Society.

[35]  Janet M. Thornton,et al.  The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data , 2004, Nucleic Acids Res..

[36]  Haim J. Wolfson,et al.  Geometric hashing: an overview , 1997 .