Investigation of sequence features of hinge-bending regions in proteins with domain movements using kernel logistic regression

Background Hinge-bending movements in proteins comprising two or more domains form a large class of functional movements. Hinge-bending regions demarcate protein domains and collectively control the domain movement. Consequently, the ability to recognise sequence features of hinge-bending regions and to be able to predict them from sequence alone would benefit various areas of protein research. For example, an understanding of how the sequence features of these regions relate to dynamic properties in multi-domain proteins would aid in the rational design of linkers in therapeutic fusion proteins. Results The DynDom database of protein domain movements comprises sequences annotated to indicate whether the amino acid residue is located within a hinge-bending region or within an intradomain region. Using statistical methods and Kernel Logistic Regression (KLR) models, this data was used to determine sequence features that favour or disfavour hinge-bending regions. This is a difficult classification problem as the number of negative cases (intradomain residues) is much larger than the number of positive cases (hinge residues). The statistical methods and the KLR models both show that cysteine has the lowest propensity for hinge-bending regions and proline has the highest, even though it is the most rigid amino acid. As hinge-bending regions have been previously shown to occur frequently at the terminal regions of the secondary structures, the propensity for proline at these regions is likely due to its tendency to break secondary structures. The KLR models also indicate that isoleucine may act as a domain-capping residue. We have found that a quadratic KLR model outperforms a linear KLR model and that improvement in performance occurs up to very long window lengths (eighty residues) indicating long-range correlations. Conclusion In contrast to the only other approach that focused solely on interdomain hinge-bending regions, the method provides a modest and statistically significant improvement over a random classifier. An explanation of the KLR results is that in the prediction of hinge-bending regions a long-range correlation is at play between a small number amino acids that either favour or disfavour hinge-bending regions. The resulting sequence-based prediction tool, HingeSeek, is available to run through a webserver at hingeseek.cmp.uea.ac.uk .

[1]  Mikael Bodén,et al.  Identifying sequence regions undergoing conformational change via predicted continuum secondary structure , 2006, Bioinform..

[2]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[3]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[4]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[5]  G. Hammes Multiple conformational changes in enzyme catalysis. , 2002, Biochemistry.

[6]  Steven Hayward,et al.  A method for the analysis of domain movements in large biomolecular complexes , 2009, Proteins.

[7]  Lorenz Wernisch,et al.  Identifying structural domains in proteins. , 2005, Methods of biochemical analysis.

[8]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[9]  Georg E Schulz Domain motions in proteins , 1992, Current Biology.

[10]  S. Teague Implications of protein flexibility for drug discovery , 2003, Nature Reviews Drug Discovery.

[11]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[12]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[13]  H. Wolfson,et al.  Flexible protein alignment and hinge detection , 2002, Proteins.

[14]  A. Lesk,et al.  Structural mechanisms for domain movements in proteins. , 1994, Biochemistry.

[15]  Moe Razaz,et al.  The DynDom Database of Protein Domain Motions , 2003, Bioinform..

[16]  Steven Hayward,et al.  Monte Carlo Sampling with Linear Inverse Kinematics for Simulation of Protein Flexible Regions. , 2015, Journal of chemical theory and computation.

[17]  B. Rost Review: protein secondary structure prediction continues to rise. , 2001, Journal of structural biology.

[18]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[19]  Mark Gerstein,et al.  Hinge Atlas: relating protein sequence to sites of structural flexibility , 2007, BMC Bioinformatics.

[20]  Alexey G. Murzin,et al.  SCOP2 prototype: a new approach to protein structure mining , 2014, Nucleic Acids Res..

[21]  Silvio C. E. Tosatto,et al.  The Pfam protein families database in 2019 , 2018, Nucleic Acids Res..

[22]  Steven Hayward,et al.  Improvements in the analysis of domain motions in proteins from conformational change: DynDom version 1.50. , 2002, Journal of molecular graphics & modelling.

[23]  Ji Zhu,et al.  Kernel Logistic Regression and the Import Vector Machine , 2001, NIPS.

[24]  Steven Hayward,et al.  Methodological improvements for the analysis of domain movements in large biomolecular complexes , 2019, Biophysics and physicobiology.

[25]  A M Lesk,et al.  Mechanisms of domain closure in proteins. , 1984, Journal of molecular biology.

[26]  J. Sengupta The Nonparametric Approach , 1989 .

[27]  Gavin C. Cawley,et al.  Generalised Kernel Machines , 2007, 2007 International Joint Conference on Neural Networks.

[28]  G. Cawley,et al.  Efficient approximate leave-one-out cross-validation for kernel logistic regression , 2008, Machine Learning.

[29]  Richard A. Lee,et al.  A comprehensive and non-redundant database of protein domain movements , 2005, Bioinform..

[30]  C. Ponting,et al.  The natural history of protein domains. , 2002, Annual review of biophysics and biomolecular structure.

[31]  K Schulten,et al.  Protein domain movements: detection of rigid domains and visualization of hinges in comparisons of atomic coordinates , 1997, Proteins.

[32]  Igor B Kuznetsov,et al.  Ordered conformational change in the protein backbone: Prediction of conformationally variable positions from sequence and low‐resolution structural data , 2008, Proteins.

[33]  Michael McDuffie,et al.  FlexPred: a web-server for predicting residue positions involved in conformational switches in proteins , 2008, Bioinformation.

[34]  Wei-Chiang Shen,et al.  Fusion protein linkers: property, design and functionality. , 2013, Advanced drug delivery reviews.

[35]  Xu Sun,et al.  Fast Implementation of DeLong’s Algorithm for Comparing the Areas Under Correlated Receiver Operating Characteristic Curves , 2014, IEEE Signal Processing Letters.

[36]  S. Hayward,et al.  Structural principles governing domain motions in proteins , 1999, Proteins.

[37]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[38]  H. Berendsen,et al.  Systematic analysis of domain motions in proteins from conformational change: New results on citrate synthase and T4 lysozyme , 1998, Proteins.

[39]  K. Hinsen,et al.  Analysis of domain motions in large proteins , 1999, Proteins.

[40]  M. Gerstein,et al.  A database of macromolecular motions. , 1998, Nucleic acids research.

[41]  P Argos,et al.  An investigation of oligopeptides linking domains in protein tertiary structures and possible candidates for general gene fusion. , 1990, Journal of molecular biology.

[42]  Jaap Heringa,et al.  An analysis of protein domain linkers: their classification and role in protein folding. , 2002, Protein engineering.

[43]  H. Berendsen,et al.  Model‐free methods of analyzing domain motions in proteins from simulation: A comparison of normal mode analysis and molecular dynamics simulation of lysozyme , 1997, Proteins.