Predicting Protein–Protein Interaction Sites Using Sequence Descriptors and Site Propensity of Neighboring Amino Acids

Information about the interface sites of Protein–Protein Interactions (PPIs) is useful for many biological research works. However, despite the advancement of experimental techniques, the identification of PPI sites still remains as a challenging task. Using a statistical learning technique, we proposed a computational tool for predicting PPI interaction sites. As an alternative to similar approaches requiring structural information, the proposed method takes all of the input from protein sequences. In addition to typical sequence features, our method takes into consideration that interaction sites are not randomly distributed over the protein sequence. We characterized this positional preference using protein complexes with known structures, proposed a numerical index to estimate the propensity and then incorporated the index into a learning system. The resulting predictor, without using structural information, yields an area under the ROC curve (AUC) of 0.675, recall of 0.597, precision of 0.311 and accuracy of 0.583 on a ten-fold cross-validation experiment. This performance is comparable to the previous approach in which structural information was used. Upon introducing the B-factor data to our predictor, we demonstrated that the AUC can be further improved to 0.750. The tool is accessible at http://bsaltools.ym.edu.tw/predppis.

[1]  Xiaolong Wang,et al.  iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach , 2016, Journal of biomolecular structure & dynamics.

[2]  Kuo-Chen Chou,et al.  iPPBS-Opt: A Sequence-Based Ensemble Classifier for Identifying Protein-Protein Binding Sites by Optimizing Imbalanced Training Datasets , 2016, Molecules.

[3]  Kaustubh D. Dhole,et al.  Sequence-based prediction of protein-protein interaction sites with L1-logreg classifier. , 2014, Journal of theoretical biology.

[4]  K. Chou Impacts of bioinformatics to medicinal chemistry. , 2015, Medicinal chemistry (Shariqah (United Arab Emirates)).

[5]  T. Petersen,et al.  A generic method for assignment of reliability scores applied to solvent accessibility predictions , 2009, BMC Structural Biology.

[6]  Sam Ansari,et al.  Statistical analysis of predominantly transient protein–protein interfaces , 2005, Proteins.

[7]  Monica Chagoyen,et al.  Characterization of clinical signs in the human interactome , 2016, Bioinform..

[8]  Zhu-Hong You,et al.  Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis , 2013, BMC Bioinformatics.

[9]  Olivier Lichtarge,et al.  BIOINFORMATICS ORIGINAL PAPER Systems biology , 2004 .

[10]  P. Bourne,et al.  Exploiting sequence and structure homologs to identify protein–protein binding sites , 2005, Proteins.

[11]  Burkhard Rost,et al.  Protein–Protein Interactions More Conserved within Species than across Species , 2006, PLoS Comput. Biol..

[12]  Hui Ding,et al.  Using deformation energy to analyze nucleosome positioning in genomes. , 2016, Genomics.

[13]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[14]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[15]  M. Kanehisa,et al.  Cluster analysis of amino acid indices for prediction of protein structure and function. , 1988, Protein engineering.

[16]  Michael Q. Zhang,et al.  Computational prediction of methylation status in human genomic sequences. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Hamid Pezeshk,et al.  Impact of residue accessible surface area on the prediction of protein secondary structures , 2008, BMC Bioinformatics.

[18]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[19]  Xue-wen Chen,et al.  Sequence-based prediction of protein interaction sites with an integrative method , 2009, Bioinform..

[20]  Wei Chen,et al.  iOri-Human: identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition , 2016, Oncotarget.

[21]  K. Chou,et al.  iCar-PseCp: identify carbonylation sites in proteins by Monte Carlo sampling and incorporating sequence coupled effects into general PseAAC , 2016, Oncotarget.

[22]  Ruth Nussinov,et al.  An overview of recent advances in structural bioinformatics of protein-protein interactions and a guide to their principles. , 2014, Progress in biophysics and molecular biology.

[23]  Kuo-Chen Chou,et al.  iSuc-PseOpt: Identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. , 2016, Analytical biochemistry.

[24]  Ruben Abagyan,et al.  Statistical analysis and prediction of protein–protein interfaces , 2005, Proteins.

[25]  Haiyuan Yu,et al.  Three-dimensional reconstruction of protein networks provides insight into human genetic disease , 2012, Nature Biotechnology.

[26]  A. Thomas,et al.  A fast method to predict protein interaction sites from sequences. , 2000, Journal of molecular biology.

[27]  V. S. Rao,et al.  Protein-Protein Interaction Detection: Methods and Analysis , 2014, International journal of proteomics.

[28]  Kuo-Chen Chou,et al.  iPPI-Esml: An ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. , 2015, Journal of theoretical biology.

[29]  S. Karlin,et al.  Evolutionary conservation of RecA genes in relation to protein structure and function , 1996, Journal of bacteriology.

[30]  Bo Yao,et al.  PhosphoSVM: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine , 2014, Amino Acids.

[31]  M. Kanehisa,et al.  Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. , 1996, Protein engineering.

[32]  Kristian Vlahovicek,et al.  Prediction of Protein–Protein Interaction Sites in Sequences and 3D Structures by Random Forests , 2009, PLoS Comput. Biol..

[33]  Huan-Xiang Zhou,et al.  Interaction-site prediction for protein complexes: a critical assessment , 2007, Bioinform..

[34]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[35]  K. Chou,et al.  Support vector machines for prediction of protein subcellular location by incorporating quasi‐sequence‐order effect , 2002, Journal of cellular biochemistry.

[36]  Lilia M. Iakoucheva,et al.  Intrinsic Disorder Is a Common Feature of Hub Proteins from Four Eukaryotic Interactomes , 2006, PLoS Comput. Biol..

[37]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[38]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[39]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[40]  Vasant Honavar,et al.  A two-stage classifier for identification of protein-protein interface residues , 2004, ISMB/ECCB.

[41]  Zhiyong Wang,et al.  Protein 8-class secondary structure prediction using Conditional Neural Fields , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[42]  Ren Long,et al.  iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition , 2016, Bioinform..

[43]  Mona Singh,et al.  Predicting functionally important residues from sequence conservation , 2007, Bioinform..

[44]  A. Valencia,et al.  Automatic methods for predicting functionally important residues. , 2003, Journal of molecular biology.

[45]  Joël Janin,et al.  The third CAPRI assessment meeting Toronto, Canada, April 20-21, 2007. , 2007, Structure.

[46]  Kuo-Bin Li,et al.  Predicting membrane protein types by incorporating protein topology, domains, signal peptides, and physicochemical properties into the general form of Chou's pseudo amino acid composition. , 2013, Journal of theoretical biology.

[47]  Kuo-Chen Chou,et al.  iPhos-PseEn: Identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier , 2016, Oncotarget.

[48]  Zsuzsanna Dosztányi,et al.  ANCHOR: web server for predicting protein binding regions in disordered proteins , 2009, Bioinform..

[49]  K. Chou,et al.  Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. , 2015, Molecular bioSystems.

[50]  Vaishali Ganganwar,et al.  An overview of classification algorithms for imbalanced datasets , 2012 .

[51]  K. Chou,et al.  pRNAm-PC: Predicting N(6)-methyladenosine sites in RNA sequences via physical-chemical properties. , 2016, Analytical biochemistry.

[52]  J. Thornton,et al.  Structural characterisation and functional significance of transient protein-protein interactions. , 2003, Journal of molecular biology.

[53]  Kuo-Chen Chou,et al.  iHyd-PseCp: Identify hydroxyproline and hydroxylysine in proteins by incorporating sequence-coupled effects into general PseAAC , 2016, Oncotarget.

[54]  Tirso Pons,et al.  Towards a detailed atlas of protein-protein interactions. , 2013, Current opinion in structural biology.

[55]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[56]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[57]  Burkhard Rost,et al.  ISIS: interaction sites identified from sequence , 2007, Bioinform..

[58]  Philip M. Kim,et al.  Relating Three-Dimensional Structures to Protein Networks Provides Evolutionary Insights , 2006, Science.

[59]  Kuo-Chen Chou,et al.  Identification of protein-protein binding sites by incorporating the physicochemical properties and stationary wavelet transforms into pseudo amino acid composition , 2016, Journal of biomolecular structure & dynamics.

[60]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[61]  Alfonso Valencia,et al.  Progress and challenges in predicting protein-protein interaction sites , 2008, Briefings Bioinform..

[62]  Kuo-Chen Chou,et al.  iROS-gPseKNC: Predicting replication origin sites in DNA by incorporating dinucleotide position-specific propensity into general pseudo nucleotide composition , 2016, Oncotarget.

[63]  Lucy J. Colwell,et al.  Inferring interaction partners from protein sequences , 2016, Proceedings of the National Academy of Sciences.

[64]  K. Chou,et al.  REVIEW : Recent advances in developing web-servers for predicting protein attributes , 2009 .

[65]  Burkhard Rost,et al.  More challenges for machine-learning protein interactions , 2015, Bioinform..

[66]  K. Chou,et al.  Predicting protein quaternary structure by pseudo amino acid composition , 2003, Proteins.

[67]  Burkhard Rost,et al.  Evolutionary profiles improve protein-protein interaction prediction from sequence , 2015, Bioinform..

[68]  Hyunsoo Kim,et al.  Protein secondary structure prediction based on an improved support vector machines approach. , 2003, Protein engineering.

[69]  Zhiqiang Ma,et al.  PSNO: Predicting Cysteine S-Nitrosylation Sites by Incorporating Various Sequence-Derived Features into the General Form of Chou’s PseAAC , 2014, International journal of molecular sciences.

[70]  Kenji Mizuguchi,et al.  Applying the Naïve Bayes classifier with kernel density estimation to the prediction of protein-protein interaction sites , 2010, Bioinform..

[71]  K. Chou,et al.  iACP: a sequence-based tool for identifying anticancer peptides , 2016, Oncotarget.

[72]  Peng Chen,et al.  Predicting protein interaction sites from residue spatial sequence profile and evolution rate , 2006, FEBS Letters.

[73]  Kengo Kinoshita,et al.  Identification of transient hub proteins and the possible structural basis for their multiple interactions , 2008, Protein science : a publication of the Protein Society.

[74]  W. Atchley,et al.  Solving the protein sequence metric problem. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[75]  David L. Robertson,et al.  An integrated view of molecular coevolution in protein-protein interactions. , 2010, Molecular biology and evolution.

[76]  Ren Long,et al.  iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework , 2016, Bioinform..

[77]  Jonathan G. Lees,et al.  Transient protein-protein interactions: structural, functional, and network properties. , 2010, Structure.

[78]  Debasis Dash,et al.  Role of intrinsic disorder in transient interactions of hub proteins , 2006, Proteins.

[79]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[80]  Eric D. Ross,et al.  Intrinsic Protein Disorder, Amino Acid Composition, and Histone Terminal Domains* , 2006, Journal of Biological Chemistry.

[81]  Itay Mayrose,et al.  Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues , 2002, ISMB.

[82]  Tal Pupko,et al.  ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids , 2010, Nucleic Acids Res..

[83]  Wenchao Jiang,et al.  Identifying protein–protein interaction sites in transient complexes with temperature factor, sequence profile and accessible surface area , 2009, Amino Acids.

[84]  Kuo-Chen Chou,et al.  pSuc-Lys: Predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. , 2016, Journal of theoretical biology.

[85]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[86]  A. Valencia,et al.  Correlated mutations contain information about protein-protein interaction. , 1997, Journal of molecular biology.

[87]  J. Forman-Kay,et al.  From sequence and forces to structure, function, and evolution of intrinsically disordered proteins. , 2013, Structure.

[88]  Jean-Christophe Nebel,et al.  Progress and challenges in predicting protein interfaces , 2015, Briefings Bioinform..

[89]  S.-W. Zhang,et al.  Prediction of protein homo-oligomer types by pseudo amino acid composition: Approached with an improved feature extraction and Naive Bayes Feature Fusion , 2006, Amino Acids.

[90]  Silvio C. E. Tosatto,et al.  MobiDB 2.0: an improved database of intrinsically disordered and mobile proteins , 2014, Nucleic Acids Res..

[91]  O. Keskin,et al.  Transient protein-protein interactions. , 2011, Protein engineering, design & selection : PEDS.

[92]  T. Takagi,et al.  Prediction of protein-protein interaction sites using support vector machines. , 2004, Protein engineering, design & selection : PEDS.

[93]  Itay Mayrose,et al.  ConSurf: Using Evolutionary Data to Raise Testable Hypotheses about Protein Function , 2013 .

[94]  Fredrik Johansson,et al.  A comparative study of conservation and variation scores , 2010, BMC Bioinformatics.

[95]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.