An Integrative Computational Framework Based on a Two-Step Random Forest Algorithm Improves Prediction of Zinc-Binding Sites in Proteins

Zinc-binding proteins are the most abundant metalloproteins in the Protein Data Bank where the zinc ions usually have catalytic, regulatory or structural roles critical for the function of the protein. Accurate prediction of zinc-binding sites is not only useful for the inference of protein function but also important for the prediction of 3D structure. Here, we present a new integrative framework that combines multiple sequence and structural properties and graph-theoretic network features, followed by an efficient feature selection to improve prediction of zinc-binding sites. We investigate what information can be retrieved from the sequence, structure and network levels that is relevant to zinc-binding site prediction. We perform a two-step feature selection using random forest to remove redundant features and quantify the relative importance of the retrieved features. Benchmarking on a high-quality structural dataset containing 1,103 protein chains and 484 zinc-binding residues, our method achieved >80% recall at a precision of 75% for the zinc-binding residues Cys, His, Glu and Asp on 5-fold cross-validation tests, which is a 10%-28% higher recall at the 75% equal precision compared to SitePredict and zincfinder at residue level using the same dataset. The independent test also indicates that our method has achieved recall of 0.790 and 0.759 at residue and protein levels, respectively, which is a performance better than the other two methods. Moreover, AUC (the Area Under the Curve) and AURPC (the Area Under the Recall-Precision Curve) by our method are also respectively better than those of the other two methods. Our method can not only be applied to large-scale identification of zinc-binding sites when structural information of the target is available, but also give valuable insights into important features arising from different levels that collectively characterize the zinc-binding sites. The scripts and datasets are available at http://protein.cau.edu.cn/zincidentifier/.

[1]  Tianyun Liu,et al.  Identification of recurring protein structure microenvironments and discovery of novel functional sites around CYS residues , 2010, BMC Structural Biology.

[2]  L. Serrano,et al.  Prediction of water and metal binding sites and their affinities by using the Fold-X force field. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[3]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[4]  Jiangning Song,et al.  Improving the accuracy of predicting disulfide connectivity by feature selection , 2010, J. Comput. Chem..

[5]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[6]  Geoffrey I. Webb,et al.  Prodepth: Predict Residue Depth by Support Vector Regression Approach from Protein Sequences Only , 2009, PloS one.

[7]  Paolo Frasconi,et al.  Predicting Metal-Binding Sites from Protein Sequence , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  J. Thornton,et al.  Satisfying hydrogen bonding potential in proteins. , 1994, Journal of molecular biology.

[9]  J. Coleman,et al.  Zinc proteins: enzymes, storage proteins, transcription factors, and replication proteins. , 1992, Annual review of biochemistry.

[10]  Chin-Teng Lin,et al.  Protein Metal Binding Residue Prediction Based on Neural Networks , 2004, ICONIP.

[11]  Joel P Mackay,et al.  Designed metal-binding sites in biomolecular and bioinorganic interactions. , 2008, Current opinion in structural biology.

[12]  Geoffrey I. Webb,et al.  TANGLE: Two-Level Support Vector Regression Approach for Protein Backbone Torsion Angle Prediction from Primary Sequences , 2012, PloS one.

[13]  Jessica C. Ebert,et al.  Robust recognition of zinc binding sites in proteins , 2007, Protein science : a publication of the Protein Society.

[14]  Lukasz A. Kurgan,et al.  Sequence-based prediction of protein crystallization, purification and production propensity , 2011, Bioinform..

[15]  Nanjiang Shu,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm618 Sequence analysis Prediction of zinc-binding sites in proteins from sequence , 2008 .

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  Antonio Rosato,et al.  Counting the zinc-proteins encoded in the human genome. , 2006, Journal of proteome research.

[18]  M. Harding,et al.  The architecture of metal coordination groups in proteins. , 2004, Acta crystallographica. Section D, Biological crystallography.

[19]  Geoffrey I. Webb,et al.  Cascleave: towards more accurate prediction of caspase substrate cleavage sites , 2010, Bioinform..

[20]  Dennis R. Livesay,et al.  How accurate and statistically robust are catalytic site predictions based on closeness centrality? , 2007, BMC Bioinformatics.

[21]  Lukasz A. Kurgan,et al.  Accurate sequence-based prediction of catalytic residues , 2008, Bioinform..

[22]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[23]  Z. Wen,et al.  Novel Feature for Catalytic Protein Residues Reflecting Interactions with Other Residues , 2011, PloS one.

[24]  B. Rost,et al.  Identifying cysteines and histidines in transition‐metal‐binding sites using support vector machines and neural networks , 2006, Proteins.

[25]  V. Sobolev,et al.  Prediction of transition metal‐binding sites from apo protein structures , 2007, Proteins.

[26]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[27]  Shekhar C Mande,et al.  Exploiting 3D structural templates for detection of metal‐binding sites in protein structures , 2008, Proteins.

[28]  L. Serrano,et al.  Prediction of sequence-dependent and mutational effects on the aggregation of peptides and proteins , 2004, Nature Biotechnology.

[29]  Burkhard Rost,et al.  UniqueProt: creating representative protein sequence sets , 2003, Nucleic Acids Res..

[30]  J. S. Sodhi,et al.  Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. , 2004, Journal of molecular biology.

[31]  Andrew J. Bordner,et al.  Predicting small ligand binding sites in proteins using backbone structure , 2008, Bioinform..

[32]  Lukasz A. Kurgan,et al.  PFRES: protein fold classification by using evolutionary information and predicted secondary structure , 2007, Bioinform..

[33]  T. Hamelryck An amino acid has two sides: A new 2D measure provides a different view of solvent exposure , 2005, Proteins.

[34]  Jiangning Song,et al.  Predicting disulfide connectivity from protein sequence using multiple sequence feature vectors and secondary structure , 2007, Bioinform..

[35]  Lianyi Han,et al.  Prediction of the functional class of metal-binding proteins from sequence derived physicochemical properties by support vector machine approach , 2006, BMC Bioinformatics.

[36]  Jiangning Song,et al.  HSEpred: predict half-sphere exposure from protein sequences , 2008, Bioinform..

[37]  Burkhard Rost,et al.  MetalDetector: a web server for predicting metal-binding sites and disulfide bridges in proteins from sequence , 2008, Bioinform..

[38]  B. Vallee,et al.  Functional zinc-binding motifs in enzymes and DNA-binding proteins. , 1992, Faraday discussions.

[39]  Jianping Zhang,et al.  Learning rules from highly unbalanced data sets , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[40]  D. Auld Zinc coordination sphere in biochemical zinc sites , 2001, Biometals.

[41]  Edward I. Solomon,et al.  Structural and Functional Aspects of Metal Sites in Biology. , 1996, Chemical reviews.

[42]  R. Varadarajan,et al.  Residue depth: a novel parameter for the analysis of protein structure and stability. , 1999, Structure.

[43]  Xiao Sun,et al.  Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature , 2008, Bioinform..

[44]  Antonio Rosato,et al.  Metalloproteomes: a bioinformatic approach. , 2009, Accounts of chemical research.

[45]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[46]  P. Frasconi,et al.  Predicting zinc binding at the proteome level , 2007, BMC Bioinformatics.

[47]  Li Yang,et al.  Predicting disease-associated substitution of a single amino acid by analyzing residue interactions , 2011, BMC Bioinformatics.

[48]  Haiyan Liu,et al.  Structure-based de novo prediction of zinc-binding sites in proteins of unknown function , 2011, Bioinform..

[49]  Yutaka Kuroda,et al.  DROP: an SVM domain linker predictor trained with optimal features selected by random forest , 2011, Bioinform..

[50]  Paolo Frasconi,et al.  Improving Prediction of Zinc Binding Sites by Modeling the Linkage Between Residues Close in Sequence , 2006, RECOMB.

[51]  Xue-wen Chen,et al.  Sequence-based prediction of protein interaction sites with an integrative method , 2009, Bioinform..

[52]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[53]  Ziding Zhang,et al.  Predicting Residue-Residue Contacts and Helix-Helix Interactions in Transmembrane Proteins Using an Integrative Feature-Based Random Forest Approach , 2011, PloS one.

[54]  Zheng Yuan,et al.  Exploiting structural and topological information to improve prediction of RNA-protein binding sites , 2009, BMC Bioinformatics.

[55]  Xing-Ming Zhao,et al.  FunSAV: Predicting the Functional Effect of Single Amino Acid Variants Using a Two-Stage Random Forest Model , 2012, PloS one.

[56]  J. S. Sodhi,et al.  Predicting metal-binding site residues in low-resolution structural models. , 2004, Journal of molecular biology.

[57]  Jianwen Fang,et al.  Predicting residue-residue contacts using random forest models , 2011, Bioinform..