An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis

BackgroundDNA-binding proteins (DNA-BPs) play a pivotal role in both eukaryotic and prokaryotic proteomes. There have been several computational methods proposed in the literature to deal with the DNA-BPs, many informative features and properties were used and proved to have significant impact on this problem. However the ultimate goal of Bioinformatics is to be able to predict the DNA-BPs directly from primary sequence.ResultsIn this work, the focus is how to transform these informative features into uniform numeric representation appropriately and improve the prediction accuracy of our SVM-based classifier for DNA-BPs. A systematic representation of some selected features known to perform well is investigated here. Firstly, four kinds of protein properties are obtained and used to describe the protein sequence. Secondly, three different feature transformation methods (OCTD, AC and SAA) are adopted to obtain numeric feature vectors from three main levels: Global, Nonlocal and Local of protein sequence and their performances are exhaustively investigated. At last, the mRMR-IFS feature selection method and ensemble learning approach are utilized to determine the best prediction model. Besides, the optimal features selected by mRMR-IFS are illustrated based on the observed results which may provide useful insights for revealing the mechanisms of protein-DNA interactions. For five-fold cross-validation over the DNAdset and DNAaset, we obtained an overall accuracy of 0.940 and 0.811, MCC of 0.881 and 0.614 respectively.ConclusionsThe good results suggest that it can efficiently develop an entirely sequence-based protocol that transforms and integrates informative features from different scales used by SVM to predict DNA-BPs accurately. Moreover, a novel systematic framework for sequence descriptor-based protein function prediction is proposed here.

[1]  H. Dyson,et al.  Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. , 1999, Journal of molecular biology.

[2]  Ling Jing,et al.  Predicting DNA- and RNA-binding proteins from sequences with kernel methods. , 2009, Journal of theoretical biology.

[3]  Kevin Struhl,et al.  Folding transition in the DMA-binding domain of GCN4 on specific binding to DNA , 1990, Nature.

[4]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[5]  Tariq Habib Afridi,et al.  Mito-GSAAC: mitochondria prediction using genetic ensemble classifier and split amino acid composition , 2012, Amino Acids.

[6]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[7]  Xiaohe Shi,et al.  Prediction of interactiveness of proteins and nucleic acids based on feature selections , 2010, Molecular Diversity.

[8]  Michael Schroeder,et al.  MetaDBSite: a meta approach to improve protein DNA-binding sites prediction , 2011, BMC Systems Biology.

[9]  Gajendra P. S. Raghava,et al.  Identification of DNA-binding proteins using support vector machines and evolutionary profiles , 2007, BMC Bioinformatics.

[10]  Geoffrey I. Webb,et al.  TANGLE: Two-Level Support Vector Regression Approach for Protein Backbone Torsion Angle Prediction from Primary Sequences , 2012, PloS one.

[11]  Gajendra P. S. Raghava,et al.  ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST , 2004, Nucleic Acids Res..

[12]  J. Thornton,et al.  An overview of the structures of protein-DNA complexes , 2000, Genome Biology.

[13]  John J. Wyrick,et al.  Genome-wide location and function of DNA binding proteins. , 2000, Science.

[14]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[15]  Yu-Dong Cai,et al.  A novel computational method to predict transcription factor DNA binding preference. , 2006, Biochemical and biophysical research communications.

[16]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[17]  Vasant Honavar,et al.  Predicting DNA-binding sites of proteins from amino acid sequence , 2006, BMC Bioinformatics.

[18]  Ziding Zhang,et al.  Descriptor‐based protein remote homology identification , 2005, Protein science : a publication of the Protein Society.

[19]  Xingming Zhao,et al.  Predicting protein–protein interactions from protein sequences using meta predictor , 2010, Amino Acids.

[20]  Akinori Sarai,et al.  Moment-based prediction of DNA-binding proteins. , 2004, Journal of molecular biology.

[21]  BMC Bioinformatics , 2005 .

[22]  Kuo-Bin Li,et al.  AAIndexLoc: predicting subcellular localization of proteins based on a new representation of sequences using amino acid indices , 2008, Amino Acids.

[23]  S. Wold,et al.  DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least-squares projections to latent structures , 1993 .

[24]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[25]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[26]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  R. Mann,et al.  The role of DNA shape in protein-DNA recognition , 2009, Nature.

[28]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[29]  Loris Nanni,et al.  High performance set of PseAAC and sequence based descriptors for protein classification. , 2010, Journal of theoretical biology.

[30]  Christine A. Orengo,et al.  Inferring Function Using Patterns of Native Disorder in Proteins , 2007, PLoS Comput. Biol..

[31]  Zsuzsanna Dosztányi,et al.  IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content , 2005, Bioinform..

[32]  Achuthsankar S. Nair,et al.  Composition, Transition and Distribution (CTD) — A dynamic feature for predictions based on hierarchical structure of cellular sorting , 2011, 2011 Annual IEEE India Conference.

[33]  Christina S. Leslie,et al.  iDBPs: a web server for the identification of DNA binding proteins , 2010, Bioinform..

[34]  Yael Mandel-Gutfreund,et al.  Annotating nucleic acid-binding function based on protein structure. , 2003, Journal of molecular biology.

[35]  Markus Brameier,et al.  Automatic discovery of cross-family sequence features associated with protein function , 2006, BMC Bioinformatics.

[36]  Vasant G Honavar,et al.  Prediction of RNA binding sites in proteins from amino acid sequence. , 2006, RNA.

[37]  Kenji Mizuguchi,et al.  Prediction of mono- and di-nucleotide-specific DNA-binding sites in proteins using neural networks , 2009, BMC Structural Biology.

[38]  Xiuzhen Zhang,et al.  Predicting disordered regions in proteins using the profiles of amino acid indices , 2009, BMC Bioinformatics.

[39]  Yu-Dong Cai,et al.  Prediction of Saccharomyces cerevisiae protein functional class from functional domain composition , 2004, Bioinform..

[40]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[41]  Achuthsankar S. Nair,et al.  New Feature Vector for Apoptosis Protein Subcellular Localization Prediction , 2011, ACC.

[42]  K. Chou,et al.  iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model , 2011, PloS one.

[43]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[44]  Loris Nanni,et al.  Combing ontologies and dipeptide composition for predicting DNA-binding proteins , 2007, Amino Acids.

[45]  N. Bhardwaj,et al.  Kernel-based machine learning protocol for predicting DNA-binding proteins , 2005, Nucleic acids research.

[46]  Jeffrey Skolnick,et al.  Efficient prediction of nucleic acid binding function from low-resolution protein structures. , 2006, Journal of molecular biology.

[47]  Keun Ho Ryu,et al.  Identification of protein functions using a machine-learning approach based on sequence-derived properties , 2009, Proteome Science.

[48]  Yu-Yen Ou,et al.  Prediction of transporter targets using efficient RBF networks with PSSM profiles and biochemical properties , 2011, Bioinform..

[49]  Yanzhi Guo,et al.  Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences , 2008, Nucleic acids research.

[50]  Jiangning Song,et al.  HSEpred: predict half-sphere exposure from protein sequences , 2008, Bioinform..

[51]  Z. Feng,et al.  Prediction of the subcellular location of prokaryotic proteins based on a new representation of the amino acid composition. , 2001, Biopolymers.

[52]  Harianto Tjong,et al.  DISPLAR: an accurate method for predicting DNA-binding sites on protein surfaces , 2007, Nucleic acids research.

[53]  Irini A. Doytchinova,et al.  BMC Bioinformatics BioMed Central Methodology article VaxiJen: a server for prediction of protective antigens, tumour , 2007 .

[54]  Yaoqi Zhou,et al.  Structure-based prediction of DNA-binding proteins by structural alignment and a volume-fraction corrected DFIRE-based energy function , 2010, Bioinform..

[55]  Xiang-Sun Zhang,et al.  Bridging protein local structures and protein functions , 2008, Amino Acids.

[56]  Saraswathi Vishveshwara,et al.  Insights into Protein–DNA Interactions through Structure Network Analysis , 2008, PLoS Comput. Biol..

[57]  Tatsuya Akutsu,et al.  Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology , 2009, BMC Bioinformatics.

[58]  Masateru Takahashi,et al.  C-terminal Phenylalanine of Bacteriophage T7 Single-stranded DNA-binding Protein Is Essential for Strand Displacement Synthesis by T7 DNA Polymerase at a Nick in DNA* , 2009, The Journal of Biological Chemistry.

[59]  Yi Xiong,et al.  An accurate feature‐based method for identifying DNA‐binding residues on protein surfaces , 2011, Proteins.

[60]  Jeffrey Skolnick,et al.  A Threading-Based Method for the Prediction of DNABinding Proteins with Application to the Human GenomeProteins with Application to the Human Genome , 2009 .

[61]  Yen-Jen Oyang,et al.  ProteDNA: a sequence-based predictor of sequence-specific DNA-binding residues in transcription factors , 2009, Nucleic Acids Res..

[62]  Shinn-Ying Ho,et al.  Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties , 2011, BMC Bioinformatics.

[63]  Jie Gui,et al.  Prediction of protein-protein interactions from protein sequence using local descriptors. , 2010, Protein and peptide letters.

[64]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[65]  Yanzhi Guo,et al.  Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features , 2007, Amino Acids.

[66]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[67]  Alex Alves Freitas,et al.  Optimizing amino acid groupings for GPCR classification , 2008, Bioinform..

[68]  Jeffrey Skolnick,et al.  A Threading-Based Method for the Prediction of DNA-Binding Proteins with Application to the Human Genome , 2009, PLoS Comput. Biol..

[69]  Yu-dong Cai,et al.  Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. , 2003, Biochimica et biophysica acta.

[70]  S Rackovsky,et al.  Global characteristics of protein sequences and their implications , 2010, Proceedings of the National Academy of Sciences.