Protein-ligand binding residue prediction enhancement through hybrid deep heterogeneous learning of sequence and structure data

MOTIVATION Knowledge of protein-ligand binding residues is important for understanding the functions of proteins and their interaction mechanisms. From experimentally solved protein structures, how to accurately identify its potential binding sites of a specific ligand on the protein is still a challenging problem. Compared with structure-alignment-based methods, machine learning algorithms provide an alternative flexible solution which is less dependent on annotated homogeneous protein structures. Several factors are important for an efficient protein-ligand prediction model, e.g. discriminative feature representation and effective learning architecture to deal with both the large-scale and severe imbalanced data. RESULTS In this study, we propose a novel deep-learning-based method called DELIA for protein-ligand binding residue prediction. In DELIA, a hybrid deep neural network is designed to integrate 1D sequence-based features with 2D structure-based amino acid distance matrices. In order to overcome the problem of severe data imbalance between the binding and non-binding residues, strategies of oversampling in mini-batch, random under-sampling, and stacking ensemble strategy are designed to enhance the model. Experimental results on five benchmark datasets demonstrate the effectiveness of proposed DELIA pipeline. AVAILABILITY The web server of DELIA is available at www.csbio.sjtu.edu.cn/bioinf/delia/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Dima Kozakov,et al.  The ClusPro web server for protein–protein docking , 2017, Nature Protocols.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[4]  Marta M. Stepniewska-Dziubinska,et al.  Development and evaluation of a deep learning model for protein–ligand binding affinity prediction , 2017, Bioinform..

[5]  Gianni De Fabritiis,et al.  DeepSite: protein‐binding site predictor using 3D‐convolutional neural networks , 2017, Bioinform..

[6]  Mona Singh,et al.  Predicting Protein Ligand Binding Sites by Combining Evolutionary Sequence Conservation and 3D Structure , 2009, PLoS Comput. Biol..

[7]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[8]  T. Blundell,et al.  Comparative protein modelling by satisfaction of spatial restraints. , 1993, Journal of molecular biology.

[9]  M Vendruscolo,et al.  Recovery of protein structure from contact maps. , 1997, Folding & design.

[10]  Mona Singh,et al.  Predicting functionally important residues from sequence conservation , 2007, Bioinform..

[11]  I. Tanaka,et al.  Crystal structures of the UDP-diacylglucosamine pyrophosphohydrase LpxH from Pseudomonas aeruginosa , 2016, Scientific Reports.

[12]  Hong-Bin Shen,et al.  IPMiner: hidden ncRNA-protein interaction sequential pattern mining with stacked autoencoder for accurate computational prediction , 2016, BMC Genomics.

[13]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[14]  Yang Zhang,et al.  BioLiP: a semi-manually curated database for biologically relevant ligand–protein interactions , 2012, Nucleic Acids Res..

[15]  J. S. Sodhi,et al.  Predicting metal-binding site residues in low-resolution structural models. , 2004, Journal of molecular biology.

[16]  Yuan-Ling Xia,et al.  Insights into Protein–Ligand Interactions: Mechanisms, Models, and Methods , 2016, International journal of molecular sciences.

[17]  A. Brivanlou,et al.  Signal Transduction and the Control of Gene Expression , 2002, Science.

[18]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[19]  Jun Hu,et al.  ATPbind: Accurate Protein-ATP Binding Site Prediction by Combining Sequence-Profiling and Structure-Based Comparisons , 2018, J. Chem. Inf. Model..

[20]  P. Barran,et al.  Mass spectrometry based tools to investigate protein-ligand interactions for drug discovery. , 2012, Chemical Society reviews.

[21]  Dario Ghersi,et al.  SITEHOUND-web: a server for ligand binding site identification in protein structures , 2009, Nucleic Acids Res..

[22]  Zhen Li,et al.  Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model , 2016, bioRxiv.

[23]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[24]  M. Vassura,et al.  Reconstruction of 3D Structures From Protein Contact Maps , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[25]  Jing Yang,et al.  R2C: improving ab initio residue contact map prediction using dynamic fusion strategy and Gaussian noise filter , 2016, Bioinform..

[26]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[27]  Jianyi Yang,et al.  CoABind: a novel algorithm for Coenzyme A (CoA)‐ and CoA derivatives‐binding residues prediction , 2018, Bioinform..

[28]  Richard M. Jackson,et al.  Q-SiteFinder: an energy-based method for the prediction of protein-ligand binding sites , 2005, Bioinform..

[29]  Michael J. E. Sternberg,et al.  3DLigandSite: predicting ligand-binding sites using similar structures , 2010, Nucleic Acids Res..

[30]  Heping Zheng,et al.  Data mining of metal ion environments present in protein structures. , 2008, Journal of inorganic biochemistry.

[31]  Yang Zhang,et al.  Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment , 2013, Bioinform..

[32]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[33]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[34]  Junchi Yan,et al.  Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks , 2017, BMC Genomics.

[35]  J. Thornton,et al.  Diversity of protein–protein interactions , 2003, The EMBO journal.

[36]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  SchmidhuberJürgen,et al.  2005 Special Issue , 2005 .

[38]  Yang Zhang,et al.  COFACTOR: an accurate comparative algorithm for structure-based protein function annotation , 2012, Nucleic Acids Res..

[39]  T. Glisovic,et al.  RNA‐binding proteins and post‐transcriptional gene regulation , 2008, FEBS letters.

[40]  M Hendlich,et al.  LIGSITE: automatic and efficient detection of potential small molecule-binding sites in proteins. , 1997, Journal of molecular graphics & modelling.

[41]  Jun Hu,et al.  Designing Template-Free Predictor for Targeting Protein-Ligand Binding Sites with Classifier Ensemble and Spatial Clustering , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[42]  Lukasz A. Kurgan,et al.  Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors , 2012, Bioinform..

[43]  D. Levitt,et al.  POCKET: a computer graphics method for identifying and displaying protein cavities and their surrounding amino acids. , 1992, Journal of molecular graphics.

[44]  B. Sjöberg,et al.  Novel ATP-cone-driven allosteric regulation of ribonucleotide reductase via the radical-generating subunit , 2018, eLife.

[45]  J. Skolnick,et al.  A threading-based method (FINDSITE) for ligand-binding site prediction and functional annotation , 2008, Proceedings of the National Academy of Sciences.

[46]  Yang Zhang,et al.  Recognizing metal and acid radical ion-binding sites by integrating ab initio modeling with template-based transferals , 2016, Bioinform..

[47]  David Hoksza,et al.  P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure , 2018, Journal of Cheminformatics.

[48]  Jun Hu,et al.  TargetATPsite: A template‐free method for ATP‐binding sites prediction with residue evolution image sparse representation and classifier ensemble , 2013, J. Comput. Chem..

[49]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.