Computationally identifying hot spots in protein-DNA binding interfaces using an ensemble approach

Background Protein-DNA interaction governs a large number of cellular processes, and it can be altered by a small fraction of interface residues, i.e., the so-called hot spots, which account for most of the interface binding free energy. Accurate prediction of hot spots is critical to understand the principle of protein-DNA interactions. There are already some computational methods that can accurately and efficiently predict a large number of hot residues. However, the insufficiency of experimentally validated hot-spot residues in protein-DNA complexes and the low diversity of the employed features limit the performance of existing methods. Results Here, we report a new computational method for effectively predicting hot spots in protein-DNA binding interfaces. This method, called PreHots (the abbreviation of Predicting Hotspots), adopts an ensemble stacking classifier that integrates different machine learning classifiers to generate a robust model with 19 features selected by a sequential backward feature selection algorithm. To this end, we constructed two new and reliable datasets (one benchmark for model training and one independent dataset for validation), which totally consist of 123 hot spots and 137 non-hot spots from 89 protein-DNA complexes. The data were manually collected from the literature and existing databases with a strict process of redundancy removal. Our method achieves a sensitivity of 0.813 and an AUC score of 0.868 in 10-fold cross-validation on the benchmark dataset, and a sensitivity of 0.818 and an AUC score of 0.820 on the independent test dataset. The results show that our approach outperforms the existing ones. Conclusions PreHots, which is based on stack ensemble of boosting algorithms, can reliably predict hot spots at the protein-DNA binding interface on a large scale. Compared with the existing methods, PreHots can achieve better prediction performance. Both the webserver of PreHots and the datasets are freely available at: http://dmb.tongji.edu.cn/tools/PreHots/.

[1]  Shuigeng Zhou,et al.  Prediction of protein-protein interaction sites using an ensemble method , 2009, BMC Bioinformatics.

[2]  Wei Li,et al.  RaptorX-Property: a web server for protein structure property prediction , 2016, Nucleic Acids Res..

[3]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Samuel Kaski,et al.  Block HSIC Lasso: model-free biomarker detection for ultra-high dimensional data , 2019, bioRxiv.

[5]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[6]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[7]  Nicholas M. Luscombe,et al.  Amino acid?base interactions: a three-dimensional analysis of protein?DNA interactions at an atomic level , 2001, Nucleic Acids Res..

[8]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Shuigeng Zhou,et al.  Boosting Prediction Performance of Protein-Protein Interaction Hot Spots by Using Structural Neighborhood Properties , 2013, J. Comput. Biol..

[10]  Kuldip K. Paliwal,et al.  Capturing non‐local interactions by long short‐term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility , 2017, Bioinform..

[11]  J. Thornton,et al.  Satisfying hydrogen bonding potential in proteins. , 1994, Journal of molecular biology.

[12]  Yaoqi Zhou,et al.  Improving prediction of protein secondary structure, backbone angles, solvent accessibility and contact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks , 2018, Bioinform..

[13]  Nita Parekh,et al.  NAPS: Network Analysis of Protein Structures , 2016, Nucleic Acids Res..

[14]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[15]  Yaoqi Zhou,et al.  Consensus scoring for enriching near‐native structures from protein–protein docking decoys , 2009, Proteins.

[16]  Denis Dupuy,et al.  Backbone-independent nucleic acid binding by splicing factor SUP-12 reveals key aspects of molecular recognition , 2014, Nature Communications.

[17]  Ling Liu,et al.  dbAMEPNI: a database of alanine mutagenic effects for protein–nucleic acid interactions , 2018, Database J. Biol. Databases Curation.

[18]  Kristin L. Sainani,et al.  Logistic Regression , 2014, PM & R : the journal of injury, function, and rehabilitation.

[19]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[20]  C. Bell,et al.  A Structure-Activity Analysis for Probing the Mechanism of Processive Double-Stranded DNA Digestion by λ Exonuclease Trimers. , 2015, Biochemistry.

[21]  Jiangning Song,et al.  HSEpred: predict half-sphere exposure from protein sequences , 2008, Bioinform..

[22]  R. H. Austin,et al.  Importance of DNA stiffness in protein–DNA binding specificity , 1987, Nature.

[23]  Lin Li,et al.  Predicting protein‐DNA binding free energy change upon missense mutations using modified MM/PBSA approach: SAMPDI webserver , 2018, Bioinform..

[24]  Zhigang Chen,et al.  PredHS: a web server for predicting protein–protein interaction hot spots by using structural neighborhood properties , 2014, Nucleic Acids Res..

[25]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[26]  R. Mann,et al.  The role of DNA shape in protein-DNA recognition , 2009, Nature.

[27]  Junfeng Xia,et al.  A feature-based approach to predict hot spots in protein-DNA binding interfaces , 2020, Briefings Bioinform..

[28]  J. Friedman Stochastic gradient boosting , 2002 .

[29]  Ozlem Keskin,et al.  Identification of computational hot spots in protein interfaces: combining solvent accessibility and inter-residue potentials improves the accuracy , 2009, Bioinform..

[30]  Mona Singh,et al.  Predicting functionally important residues from sequence conservation , 2007, Bioinform..

[31]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[32]  Ole Winther,et al.  NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning , 2018, bioRxiv.

[33]  G. Orphanides,et al.  A Unified Theory of Gene Expression , 2002, Cell.

[34]  Anna Veronika Dorogush,et al.  CatBoost: gradient boosting with categorical features support , 2018, ArXiv.

[35]  T. Hamelryck An amino acid has two sides: A new 2D measure provides a different view of solvent exposure , 2005, Proteins.

[36]  M. J. van der Laan,et al.  Statistical Applications in Genetics and Molecular Biology Super Learner , 2010 .

[37]  Chenhsiung Chan,et al.  Relationship between local structural entropy and protein thermostabilty , 2004, Proteins.

[38]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[39]  A. Velázquez‐Campoy,et al.  Isothermal Titration Calorimetry , 2004, Current protocols in cell biology.

[40]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[41]  Silvio C. E. Tosatto,et al.  The RING 2.0 web server for high quality residue interaction networks , 2016, Nucleic Acids Res..

[42]  Xiaodi Su,et al.  Characterization of protein--DNA interactions using surface plasmon resonance spectroscopy with various assay schemes. , 2007, Biochemistry.

[43]  R. Roeder,et al.  Role of general and gene-specific cofactors in the regulation of eukaryotic transcription. , 1998, Cold Spring Harbor symposia on quantitative biology.

[44]  Zixiang Wang,et al.  Computational identification of binding energy hot spots in protein–RNA complexes using an ensemble approach , 2018, Bioinform..

[45]  Tingjun Hou,et al.  Assessing the Performance of the MM/PBSA and MM/GBSA Methods. 1. The Accuracy of Binding Free Energy Calculations Based on Molecular Dynamics Simulations , 2011, J. Chem. Inf. Model..

[46]  Akinori Sarai,et al.  ProTherm and ProNIT: thermodynamic databases for proteins and protein–nucleic acid interactions , 2005, Nucleic Acids Res..

[47]  David B. Ascher,et al.  mCSM–NA: predicting the effects of mutations on protein–nucleic acids interactions , 2017, Nucleic Acids Res..

[48]  Ning Zhang,et al.  PremPDI estimates and interprets the effects of missense mutations on protein-DNA interactions , 2018, PLoS Comput. Biol..

[49]  Yaoqi Zhou,et al.  Improving protein disorder prediction by deep bidirectional long short‐term memory recurrent neural networks , 2016, Bioinform..

[50]  Bairong Shen,et al.  The construction of an amino acid network for understanding protein structure and function , 2014, Amino Acids.

[51]  S. Diekmann,et al.  Recent advances in FRET: distance determination in protein-DNA complexes. , 2001, Current opinion in structural biology.

[52]  Lei Deng,et al.  Accurate prediction of functional effects for variants by combining gradient tree boosting with optimal neighborhood properties , 2017, PloS one.

[53]  A. Kolinski,et al.  Structural features that predict real‐value fluctuations of globular proteins , 2012, Proteins.