Disulfide Connectivity Prediction Based on Modelled Protein 3D Structural Information and Random Forest Regression

Disulfide connectivity is an important protein structural characteristic. Accurately predicting disulfide connectivity solely from protein sequence helps to improve the intrinsic understanding of protein structure and function, especially in the post-genome era where large volume of sequenced proteins without being functional annotated is quickly accumulated. In this study, a new feature extracted from the predicted protein 3D structural information is proposed and integrated with traditional features to form discriminative features. Based on the extracted features, a random forest regression model is performed to predict protein disulfide connectivity. We compare the proposed method with popular existing predictors by performing both cross-validation and independent validation tests on benchmark datasets. The experimental results demonstrate the superiority of the proposed method over existing predictors. We believe the superiority of the proposed method benefits from both the good discriminative capability of the newly developed features and the powerful modelling capability of the random forest. The web server implementation, called TargetDisulfide, and the benchmark datasets are freely available at: http://csbio.njust.edu.cn/bioinf/TargetDisulfide for academic use.

[1]  Peter Clote,et al.  DiANNA 1.1: an extension of the DiANNA web server for ternary cysteine classification , 2006, Nucleic Acids Res..

[2]  Jingyu Yang,et al.  SOMRuler: A Novel Interpretable Transmembrane Helices Predictor , 2011, IEEE Transactions on NanoBioscience.

[3]  Lukasz A. Kurgan,et al.  Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources , 2010, Bioinform..

[4]  Hakan Erdogan,et al.  Bayesian Models and Algorithms for Protein β-Sheet Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Lukasz Kurgan,et al.  RAPID: fast and accurate sequence-based prediction of intrinsic disorder content on proteomic scale. , 2013, Biochimica et biophysica acta.

[6]  Piero Fariselli,et al.  Prediction of disulfide connectivity in proteins , 2001, Bioinform..

[7]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[8]  Jun Hu,et al.  Designing Template-Free Predictor for Targeting Protein-Ligand Binding Sites with Classifier Ensemble and Spatial Clustering , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  M Michael Gromiha Influence of long-range contacts and surrounding residues on the transition state structures of proteins. , 2011, Analytical biochemistry.

[10]  Jian Yang,et al.  Joint Laplacian feature weights learning , 2014, Pattern Recognit..

[11]  Pietro Liò,et al.  Identification of DNA regulatory motifs using Bayesian variable selection , 2004, Bioinform..

[12]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[13]  C. Sander,et al.  A method to predict functional residues in proteins , 1995, Nature Structural Biology.

[14]  Cheng-Yan Kao,et al.  Improving disulfide connectivity prediction with sequential distance between oxidized cysteines , 2005, Bioinform..

[15]  András Fiser,et al.  Predicting disulfide bond connectivity in proteins by correlated mutations analysis , 2008, Bioinform..

[16]  C. Anfinsen Principles that govern the folding of protein chains. , 1973, Science.

[17]  S. Metallo,et al.  Intrinsically disordered proteins are potential drug targets. , 2010, Current opinion in chemical biology.

[18]  William R. Taylor,et al.  Analysis and prediction of protein β-sheet structures by a combinatorial approach , 1980, Nature.

[19]  R. Schulz,et al.  Protein Structure Prediction , 2020, Methods in Molecular Biology.

[20]  Yang Zhang,et al.  I-TASSER server for protein 3D structure prediction , 2008, BMC Bioinformatics.

[21]  Gianluca Pollastri,et al.  Porter, PaleAle 4.0: high-accuracy prediction of protein secondary structure and relative solvent accessibility , 2013, Bioinform..

[22]  Lukasz A. Kurgan,et al.  D2P2: database of disordered protein predictions , 2012, Nucleic Acids Res..

[23]  Ben M. Webb,et al.  Comparative Protein Structure Modeling Using Modeller , 2006, Current protocols in bioinformatics.

[24]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[25]  B. D. Finetti,et al.  Bayesian inference and decision techniques : essays in honor of Bruno de Finetti , 1986 .

[26]  Yvan Saeys,et al.  Feature selection for splice site prediction: A new method using EDA-based feature ranking , 2004, BMC Bioinformatics.

[27]  Jiangning Song,et al.  Prediction of cis/trans isomerization in proteins using PSI-BLAST profiles and secondary structure information , 2006, BMC Bioinformatics.

[28]  Piero Fariselli,et al.  Prediction of disulfide connectivity in proteins with machine-learning methods and correlated mutations , 2013, BMC Bioinformatics.

[29]  Thomas Zander,et al.  Snapshots of DsbA in Action: Detection of Proteins in the Process of Oxidative Folding , 2004, Science.

[30]  Jiangning Song,et al.  Predicting residue-wise contact orders in proteins by support vector regression , 2006, BMC Bioinformatics.

[31]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[32]  Piero Fariselli,et al.  Improving the prediction of disulfide bonds in Eukaryotes with machine learning methods and protein subcellular localization , 2011, Bioinform..

[33]  Sushmita Mitra,et al.  Hidden Markov Models, Grammars, and Biology: a Tutorial , 2005, J. Bioinform. Comput. Biol..

[34]  Jiangning Song,et al.  Improving the accuracy of predicting disulfide connectivity by feature selection , 2010, J. Comput. Chem..

[35]  Lukasz A. Kurgan,et al.  SPINE X: Improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles , 2012, J. Comput. Chem..

[36]  Jian Yang,et al.  Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling , 2013, Neurocomputing.

[37]  GusfieldDan Introduction to the IEEE/ACM Transactions on Computational Biology and Bioinformatics , 2004 .

[38]  Bart De Moor,et al.  Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks , 2006, ISMB.

[39]  Cheng-Yan Kao,et al.  Bioinformatics approaches for disulfide connectivity prediction. , 2007, Current protein & peptide science.

[40]  Piero Fariselli,et al.  Reconstruction of 3D Structures From Protein Contact Maps , 2008, IEEE ACM Trans. Comput. Biol. Bioinform..

[41]  Louis Wehenkel,et al.  On the Relevance of Sophisticated Structural Annotations for Disulfide Connectivity Pattern Prediction , 2013, PloS one.

[42]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[43]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[44]  Jianlin Cheng,et al.  Predicting protein residue-residue contacts using deep networks and boosting , 2012, Bioinform..

[45]  David T. Jones,et al.  Transmembrane protein topology prediction using support vector machines , 2009, BMC Bioinformatics.

[46]  I. Song,et al.  Working Set Selection Using Second Order Information for Training Svm, " Complexity-reduced Scheme for Feature Extraction with Linear Discriminant Analysis , 2022 .

[47]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[48]  T. Sterling Publication Decisions and their Possible Effects on Inferences Drawn from Tests of Significance—or Vice Versa , 1959 .

[49]  Satoshi Murakami,et al.  Crystal Structure of the DsbB-DsbA Complex Reveals a Mechanism of Disulfide Bond Generation , 2006, Cell.

[50]  Lukasz A. Kurgan,et al.  Improved Sequence-Based Prediction of Strand residues , 2011, J. Bioinform. Comput. Biol..

[51]  Ben M. Webb,et al.  Comparative Protein Structure Modeling Using MODELLER , 2016, Current protocols in bioinformatics.

[52]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[53]  Lin-Yu Tseng,et al.  DBCP: a web server for disulfide bonding connectivity pattern prediction without the prior knowledge of the bonding state of cysteines , 2010, Nucleic Acids Res..

[54]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[55]  Kristian Vlahovicek,et al.  Prediction of Protein–Protein Interaction Sites in Sequences and 3D Structures by Random Forests , 2009, PLoS Comput. Biol..

[56]  W. Delano The PyMOL Molecular Graphics System , 2002 .

[57]  Jiangning Song,et al.  Predicting disulfide connectivity from protein sequence using multiple sequence feature vectors and secondary structure , 2007, Bioinform..

[58]  Alessio Ceroni,et al.  DISULFIND: a disulfide bonding state and cysteine connectivity prediction server , 2006, Nucleic Acids Res..

[59]  A. Szilágyi,et al.  Improving protein structure prediction using multiple sequence-based contact predictions. , 2011, Structure.

[60]  K. Inaba,et al.  MBSJ MCC Young Scientist Award 2009
REVIEW: Structural basis of protein disulfide bond generation in the cell , 2010, Genes to cells : devoted to molecular & cellular mechanisms.

[61]  Jenn-Kang Hwang,et al.  Predicting disulfide connectivity patterns , 2007, Proteins.

[62]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[63]  Claire O'Donovan,et al.  A guide to UniProt for protein scientists. , 2011, Methods in molecular biology.

[64]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[65]  Yang Zhang,et al.  I-TASSER: a unified platform for automated protein structure and function prediction , 2010, Nature Protocols.

[66]  Anne-Laure Boulesteix,et al.  Over-optimism in bioinformatics research , 2010, Bioinform..