Identification of ligand-binding residues using protein sequence profile alignment and query-specific support vector machine model.

Abstract Information embedded in ligand-binding residues (LBRs) of proteins is important for understanding protein functions. How to accurately identify the potential ligand-binding residues is still a challenging problem, especially only protein sequence is given. In this paper, we establish a new query-specific computational method, named I-LBR, for the identification of LBRs without directly using the information of protein 3D structure. I-LBR includes two modes, named as I-LBRGP and I-LBRLS, for the general-purpose and ligand-specific LBR identification. For both modes, I-LBR first construct the specific training subset based on the query sequence information; then use support vector machine (SVM) algorithm to learn the LBR identification model; finally, predict the probability of each residue in query protein belongs to the class of LBR. Experimental results on four testing dataset demonstrate that I-LBRLS is the better choice against I-LBRGP, when the ligand type/types of the query protein binds is/are known. Comparing to other state-of-the-art LBR identification methods, I-LBR can achieve a better or comparable performance. The web-server of I-LBR and dataset used in this study are freely available for academic use at https://jun-csbio.github.io/I-LBR .

[1]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[2]  Kuo-Chen Chou Proposing 5-Steps Rule Is a Notable Milestone for Studying Molecular Biology , 2020 .

[3]  Bernard Kamsu-Foguem,et al.  Deep convolution neural network for image recognition , 2018, Ecol. Informatics.

[4]  Mona Singh,et al.  Predicting Protein Ligand Binding Sites by Combining Evolutionary Sequence Conservation and 3D Structure , 2009, PLoS Comput. Biol..

[5]  Pufeng Du,et al.  PseAAC-General: Fast Building Various Modes of General Form of Chou’s Pseudo-Amino Acid Composition for Large-Scale Protein Datasets , 2014, International journal of molecular sciences.

[6]  R. Abagyan,et al.  Pocketome via Comprehensive Identification and Classification of Ligand Binding Envelopes* , 2005, Molecular & Cellular Proteomics.

[7]  J. Skolnick,et al.  A threading-based method (FINDSITE) for ligand-binding site prediction and functional annotation , 2008, Proceedings of the National Academy of Sciences.

[8]  Jing-Yu Yang,et al.  A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction , 2014, PloS one.

[9]  K. Chou,et al.  iPGK-PseAAC: Identify Lysine Phosphoglycerylation Sites in Proteins by Incorporating Four Different Tiers of Amino Acid Pairwise Coupling Information into the General PseAAC. , 2017, Medicinal chemistry (Shariqah (United Arab Emirates)).

[10]  David S. Goodsell,et al.  The RCSB protein data bank: integrative view of protein, gene and 3D structural information , 2016, Nucleic Acids Res..

[11]  Lukasz Kurgan,et al.  ATPsite: sequence-based prediction of ATP-binding residues , 2011, Proteome Science.

[12]  Hong-Bin Shen,et al.  Protein-ligand binding residue prediction enhancement through hybrid deep heterogeneous learning of sequence and structure data , 2020, Bioinform..

[13]  K. Chou Advance in predicting subcellular localization of multi-label proteins and its implication for developing multi-target drugs. , 2019, Current medicinal chemistry.

[14]  R. Laskowski SURFNET: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. , 1995, Journal of molecular graphics.

[15]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[16]  Yang Zhang,et al.  The I-TASSER Suite: protein structure and function prediction , 2014, Nature Methods.

[17]  Gajendra P. S. Raghava,et al.  Identification of ATP binding residues of a protein from its primary sequence , 2009, BMC Bioinformatics.

[18]  J. Skolnick,et al.  FINDSITE‐metal: Integrating evolutionary information and machine learning for structure‐based metal‐binding site prediction at the proteome level , 2011, Proteins.

[19]  Jun Hu,et al.  Designing Template-Free Predictor for Targeting Protein-Ligand Binding Sites with Classifier Ensemble and Spatial Clustering , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  Lukasz A. Kurgan,et al.  Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors , 2012, Bioinform..

[21]  Liam J. McGuffin,et al.  FunFOLD: an improved automated method for the prediction of ligand binding residues using 3D models of proteins , 2011, BMC Bioinformatics.

[22]  Jianjun Hu,et al.  HemeBIND: a novel method for heme binding residue prediction by combining structural and sequence information , 2011, BMC Bioinformatics.

[23]  K. Chou Pseudo Amino Acid Composition and its Applications in Bioinformatics, Proteomics and System Biology , 2009 .

[24]  S. Forsén,et al.  Graphical rules for enzyme-catalysed rate laws. , 1980, The Biochemical journal.

[25]  G. Zhou,et al.  An extension of Chou's graphic rules for deriving enzyme kinetic equations to systems involving parallel reaction pathways. , 1984, The Biochemical journal.

[26]  Jun Hu,et al.  Constructing Query-Driven Dynamic Machine Learning Model With Application to Protein-Ligand Binding Sites Prediction , 2015, IEEE Transactions on NanoBioscience.

[27]  Keehyoung Joo,et al.  proteins STRUCTURE O FUNCTION O BIOINFORMATICS SANN: Solvent accessibility prediction of proteins , 2022 .

[28]  Yang Zhang,et al.  I-TASSER server for protein 3D structure prediction , 2008, BMC Bioinformatics.

[29]  R. Wade,et al.  Computational approaches to identifying and characterizing protein binding sites for ligand design , 2009, Journal of molecular recognition : JMR.

[30]  Yang Zhang,et al.  How significant is a protein structure similarity with TM-score = 0.5? , 2010, Bioinform..

[31]  Yang Li,et al.  Predicting Protein-DNA Binding Residues by Weightedly Combining Sequence-Based Features and Boosting Multiple SVMs , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[33]  Michael J. E. Sternberg,et al.  3DLigandSite: predicting ligand-binding sites using similar structures , 2010, Nucleic Acids Res..

[34]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Yang Zhang,et al.  Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment , 2013, Bioinform..

[36]  Michael Schroeder,et al.  MetaDBSite: a meta approach to improve protein DNA-binding sites prediction , 2011, BMC Systems Biology.

[37]  Kuo-Chen Chou Other Mountain Stones Can Attack Jade: The 5-Steps Rule , 2020 .

[38]  Itay Mayrose,et al.  Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues , 2002, ISMB.

[39]  Jun Hu,et al.  DNAPred: Accurate Identification of DNA-Binding Sites from Protein Sequence by Ensembled Hyperplane-Distance-Based Support Vector Machines , 2019, J. Chem. Inf. Model..

[40]  Yang Zhang,et al.  COFACTOR: an accurate comparative algorithm for structure-based protein function annotation , 2012, Nucleic Acids Res..

[41]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[42]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[43]  M Hendlich,et al.  LIGSITE: automatic and efficient detection of potential small molecule-binding sites in proteins. , 1997, Journal of molecular graphics & modelling.

[44]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[45]  Yang Zhang,et al.  BioLiP: a semi-manually curated database for biologically relevant ligand–protein interactions , 2012, Nucleic Acids Res..

[46]  David Baker,et al.  Advances in Rosetta protein structure prediction on massively parallel systems , 2008, IBM J. Res. Dev..

[47]  N. Ben-Tal,et al.  ConSurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. , 2001, Journal of molecular biology.

[48]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[49]  Kuo-Chen Chou,et al.  Use Chou’s 5-Steps Rule to Predict Remote Homology Proteins by Merging Grey Incidence Analysis and Domain Similarity Analysis , 2020, Natural Science.

[50]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[51]  M. Swindells,et al.  Protein clefts in molecular recognition and function. , 1996, Protein science : a publication of the Protein Society.

[52]  T. Blundell,et al.  Comparative protein modelling by satisfaction of spatial restraints. , 1993, Journal of molecular biology.

[53]  Mona Singh,et al.  Predicting functionally important residues from sequence conservation , 2007, Bioinform..

[54]  Bin Liu,et al.  Pse-in-One 2.0: An Improved Package of Web Servers for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences , 2017 .

[55]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[56]  Jun Hu,et al.  ATPbind: Accurate Protein-ATP Binding Site Prediction by Combining Sequence-Profiling and Structure-Based Comparisons , 2018, J. Chem. Inf. Model..

[57]  Yang Zhang,et al.  COFACTOR: improved protein function prediction by combining structure, sequence and protein–protein interaction information , 2017, Nucleic Acids Res..

[58]  Bin Liu,et al.  BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches , 2019, Briefings Bioinform..