Identification of DNA-protein binding sites by bootstrap multiple convolutional neural networks on sequence information

Abstract Identification of DNA–protein binding sites in protein sequence plays an essential role in a wide variety of biological processes. In particular, there are huge volumes of protein sequences accumulated in the post-genomic era. In this study, we propose a new prediction approach appropriate for imbalanced DNA–protein binding sites data. Specifically, motivated by the imbalanced problem of the distribution of DNA–protein binding and non-binding sites, we employ the Adaptive Synthetic Sampling (ADASYN) approach to over-sample the positive data and Bootstrap strategy to under-sample the negative data to balance the number of the binding and non-binding samples. Furthermore, we employ the three types of features: the position specific scoring matrix, one-hot encoding and predicted solvent accessibility, to encode the sequence-based feature of each protein residue. In addition, we design an ensemble convolutional neural network classifier to handle the imbalance problem between binding and non-binding sites in protein sequence. Extensive experiments were conducted on the real DNA–protein binding sites dataset, PDNA-543, PDNA-224 and PDNA-316, in order to validate the effectiveness of our method on predicting the binding sites by ten-fold cross-validation metric. The experimental results demonstrate that our method achieves a high prediction performance and outperforms the state-of-the-art sequence-based DNA–protein binding sites predictors in terms of the Sensitivity, Specificity, Accuracy, Precision and Mathew’s Correlation Coefficient ( M C C ). Our method can obtain the M C C values of 0.63, 0.48 and 0.67 on PDNA-543, PDNA-224 and PDNA-316 datasets, respectively. Compared with the state-of-the art prediction models, the M C C values for our method are increased by at least 0.24, 0.13 and 0.23 on PDNA-543, PDNA-224 and PDNA-316 datasets, respectively.

[1]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[2]  Yu-Dong Cai,et al.  Predicting DNA-binding sites of proteins based on sequential and 3D structural information , 2014, Molecular Genetics and Genomics.

[3]  Yi Li,et al.  Gene expression inference with deep learning , 2015, bioRxiv.

[4]  Christodoulos A. Floudas,et al.  Proteome-wide post-translational modification statistics: frequency analysis and curation of the swiss-prot database , 2011, Scientific reports.

[5]  De-Shuang Huang,et al.  A Constructive Hybrid Structure Optimization Methodology for Radial Basis Probabilistic Neural Networks , 2008, IEEE Transactions on Neural Networks.

[6]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[7]  Jianlin Cheng,et al.  A Deep Learning Network Approach to ab initio Protein Secondary Structure Prediction , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Mohammad Shukri Salman,et al.  Back-propagation algorithm with variable adaptive momentum , 2016, Knowl. Based Syst..

[10]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[11]  Alexander Tretyakov,et al.  High-intensity UV laser ChIP-seq for the study of protein-DNA interactions in living cells , 2017, Nature Communications.

[12]  K. Chou,et al.  iKcr-PseEns: Identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier. , 2017, Genomics.

[13]  Yang Li,et al.  Predicting Protein-DNA Binding Residues by Weightedly Combining Sequence-Based Features and Boosting Multiple SVMs , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14]  Jian Song,et al.  Identification of DNA–protein Binding Sites through Multi-Scale Local Average Blocks on Sequence Information , 2017, Molecules.

[15]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[16]  Yen-Jen Oyang,et al.  ProteDNA: a sequence-based predictor of sequence-specific DNA-binding residues in transcription factors , 2009, Nucleic Acids Res..

[17]  Kuldip K. Paliwal,et al.  Capturing non‐local interactions by long short‐term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility , 2017, Bioinform..

[18]  Shaojie Qiao,et al.  SocialMix: A familiarity-based and preference-aware location suggestion approach , 2018, Eng. Appl. Artif. Intell..

[19]  M. Gromiha,et al.  Real value prediction of solvent accessibility from amino acid sequence , 2003, Proteins.

[20]  Michael Schroeder,et al.  MetaDBSite: a meta approach to improve protein DNA-binding sites prediction , 2011, BMC Systems Biology.

[21]  Xindong Wu,et al.  Predicting Long-Term Trajectories of Connected Vehicles via the Prefix-Projection Technique , 2018, IEEE Transactions on Intelligent Transportation Systems.

[22]  Min Zhu,et al.  Using ensemble methods to deal with imbalanced data in predicting protein-protein interactions , 2012, Comput. Biol. Chem..

[23]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[24]  Qin Lu,et al.  EL_PSSM-RT: DNA-binding residue prediction by integrating ensemble learning with PSSM Relation Transformation , 2017, BMC Bioinformatics.

[25]  Sheng Zhong,et al.  GeNemo: a search engine for web-based functional genomic data , 2016, Nucleic Acids Res..

[26]  Christian Cole,et al.  JPred4: a protein secondary structure prediction server , 2015, Nucleic Acids Res..

[27]  Shaojie Qiao,et al.  A Self-Adaptive Parameter Selection Trajectory Prediction Approach via Hidden Markov Models , 2015, IEEE Transactions on Intelligent Transportation Systems.

[28]  Marina Cretich,et al.  Protein microarray technology: how far off is routine diagnostics? , 2014, The Analyst.

[29]  Shaojie Qiao,et al.  TraPlan: An Effective Three-in-One Trajectory-Prediction Model in Transportation Networks , 2015, IEEE Transactions on Intelligent Transportation Systems.

[30]  Martha L. Bulyk,et al.  UniPROBE, update 2015: new tools and content for the online database of protein-binding microarray data on protein–DNA interactions , 2014, Nucleic Acids Res..

[31]  Zhaolei Zhang,et al.  Computational learning on specificity-determining residue-nucleotide interactions , 2015, Nucleic acids research.

[32]  Yoav Goldberg,et al.  A Primer on Neural Network Models for Natural Language Processing , 2015, J. Artif. Intell. Res..

[33]  Saeid Nahavandi,et al.  Constructing Optimal Prediction Intervals by Using Neural Networks and Bootstrap Method , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[34]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[35]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[36]  Jack Y. Yang,et al.  BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features , 2010, BMC Systems Biology.

[37]  Rui Zhao,et al.  An Overview of the Prediction of Protein DNA-Binding Sites , 2015, International journal of molecular sciences.

[38]  Yunjun Gao,et al.  A Fast Parallel Community Discovery Model on Complex Networks Through Approximate Optimization , 2018, IEEE Transactions on Knowledge and Data Engineering.

[39]  D.-S. Huang,et al.  Radial Basis Probabilistic Neural Networks: Model and Application , 1999, Int. J. Pattern Recognit. Artif. Intell..

[40]  Tao Li,et al.  PreDNA: accurate prediction of DNA-binding sites in proteins by integrating sequence and geometric structure information , 2013, Bioinform..

[41]  Hau-San Wong,et al.  A Comparison Study for DNA Motif Modeling on Protein Binding Microarray , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[42]  Liangjiang Wang,et al.  BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences , 2006, Nucleic Acids Res..

[43]  Yanmin Qian,et al.  Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[44]  Michael J E Sternberg,et al.  The Phyre2 web portal for protein modeling, prediction and analysis , 2015, Nature Protocols.

[45]  Igor B. Kuznetsov,et al.  DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins , 2007, Bioinform..