Prediction of DNA-binding residues from sequence information using convolutional neural network

Most DNA-binding residue prediction methods overlooked the motif features which are important for the recognition between protein and DNA. In order to efficiently use the motif features for prediction, we first propose to use Convolutional Neural Network (CNN) in deep learning to extract discriminant motif features. We then propose a neural network classifier, referred to as CNNsite, by combining the extracted motif features, sequence features and evolutionary features. The evaluation on PDNA-62, PDNA-224 and TR-265 shows that motif features perform better than sequence features and evolutionary features. The evaluation on PDNA-62, PDNA-224 and an independent data set shows that CNNsite also outperforms the previous methods. We also show that many motif features composed by the residues which play important roles in DNA-protein interactions have large discriminant powers. It indicates that CNNsite has very good ability to extract important motif features for DNA-binding residue prediction.

[1]  Huseyin Seker,et al.  Combining multiple clusterings for protein structure prediction , 2014, Int. J. Data Min. Bioinform..

[2]  Vasant Honavar,et al.  Predicting DNA-binding sites of proteins from amino acid sequence , 2006, BMC Bioinformatics.

[3]  Harianto Tjong,et al.  DISPLAR: an accurate method for predicting DNA-binding sites on protein surfaces , 2007, Nucleic acids research.

[4]  Aleksey A. Porollo,et al.  Linear Regression Models for Solvent Accessibility Prediction in Proteins , 2005, J. Comput. Biol..

[5]  Jian Peng,et al.  Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields , 2015, Scientific Reports.

[6]  Seren Soner,et al.  DNABINDPROT: fluctuation-based predictor of DNA-binding residues within a network of interacting residues , 2010, Nucleic Acids Res..

[7]  Aleksey A. Porollo,et al.  Accurate prediction of solvent accessibility using neural networks–based regression , 2004, Proteins.

[8]  Xiao Sun,et al.  Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature , 2008, Bioinform..

[9]  Peer Bork,et al.  SMART: identification and annotation of domains from signalling and extracellular protein sequences , 1999, Nucleic Acids Res..

[10]  Hyojin Kang,et al.  Genome-Wide DNA-Binding Specificity of PIL5, a Arabidopsis Basic Helix-Loop-Helix (bHLH) Transcription Factor , 2008, 2008 IEEE International Conference on Bioinformatics and Biomedicine.

[11]  T. Richmond,et al.  Crystal structure of the nucleosome core particle at 2.8 Å resolution , 1997, Nature.

[12]  Xiao Sun,et al.  Sequence-Based Prediction of DNA-Binding Residues in Proteins with Conservation and Correlation Information , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[13]  Xiaolong Wang,et al.  repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects , 2015, Bioinform..

[14]  Michael Schroeder,et al.  MetaDBSite: a meta approach to improve protein DNA-binding sites prediction , 2011, BMC Systems Biology.

[15]  Tao Li,et al.  PreDNA: accurate prediction of DNA-binding sites in proteins by integrating sequence and geometric structure information , 2013, Bioinform..

[16]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[17]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[18]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Hayato Yamana,et al.  Condensing position-specific scoring matrixs by the Kidera factors for ligand-binding site prediction , 2015, Int. J. Data Min. Bioinform..

[20]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  Liangjiang Wang,et al.  Prediction of DNA-binding residues from protein sequence information using random forests , 2009, BMC Genomics.

[23]  Liangjiang Wang,et al.  BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences , 2006, Nucleic Acids Res..

[24]  J. Marko,et al.  How do site-specific DNA-binding proteins find their targets? , 2004, Nucleic acids research.

[25]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[26]  Aleksey A. Porollo,et al.  Combining prediction of secondary structure and solvent accessibility in proteins , 2005, Proteins.

[27]  Burkhard Rost,et al.  Prediction of DNA-binding residues from sequence , 2007, ISMB/ECCB.

[28]  Mark Ptashne,et al.  Regulation of transcription: from lambda to eukaryotes. , 2005, Trends in biochemical sciences.

[29]  Seungwoo Hwang,et al.  Using evolutionary and structural information to predict DNA‐binding sites on DNA‐binding proteins , 2006, Proteins.

[30]  Julie C. Mitchell,et al.  DBSI: DNA-binding site identifier , 2013, Nucleic acids research.

[31]  Ian H. Witten,et al.  WEKA: a machine learning workbench , 1994, Proceedings of ANZIIS '94 - Australian New Zealnd Intelligent Information Systems Conference.

[32]  Andrew Travers,et al.  DNA-Protein Interactions , 1993, Springer Netherlands.

[33]  R. T. Dame,et al.  The role of nucleoid‐associated proteins in the organization and compaction of bacterial chromatin , 2005, Molecular microbiology.

[34]  Carmay Lim,et al.  DR_bind: a web server for predicting DNA-binding residues from the protein structure based on electrostatics, evolution and geometry , 2012, Nucleic Acids Res..

[35]  H M Berman,et al.  Protein-DNA interactions: A structural analysis. , 1999, Journal of molecular biology.

[36]  Shandar Ahmad,et al.  Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information , 2004, Bioinform..

[37]  Igor B. Kuznetsov,et al.  DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins , 2007, Bioinform..

[38]  Shandar Ahmad,et al.  PSSM-based prediction of DNA binding sites in proteins , 2005, BMC Bioinformatics.

[39]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..

[40]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[41]  Kuldip K. Paliwal,et al.  A mixture of physicochemical and evolutionary-based feature extraction approaches for protein fold recognition , 2015, Int. J. Data Min. Bioinform..

[42]  Janet M Thornton,et al.  Using structural motif templates to identify proteins with DNA binding function. , 2003, Nucleic acids research.

[43]  M Sieber,et al.  Arginine (348) is a major determinant of the DNA binding specificity of transcription factor E12. , 1998, Biological chemistry.

[44]  R. Sauer,et al.  Protein-DNA recognition. , 1984, Annual review of biochemistry.

[45]  Shinn-Ying Ho,et al.  Design of accurate predictors for DNA-binding sites in proteins using hybrid SVM-PSSM method , 2007, Biosyst..

[46]  Jack Y. Yang,et al.  BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features , 2010, BMC Systems Biology.

[47]  Qin Lu,et al.  CNNsite: Prediction of DNA-binding residues in proteins using Convolutional Neural Network with sequence features , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[48]  Ah Chung Tsoi,et al.  Face recognition: a convolutional neural-network approach , 1997, IEEE Trans. Neural Networks.

[49]  Jeffrey Skolnick,et al.  Efficient prediction of nucleic acid binding function from low-resolution protein structures. , 2006, Journal of molecular biology.