Prediction of DNA-binding proteins by interaction fusion feature representation and selective ensemble

Abstract DNA-binding proteins play important roles in various cellular processes, and the identification of DNA-binding proteins is important for understanding and interpreting protein function. This manuscript presents algorithms for feature representation based on primary protein sequences and selective ensemble classification. We first propose a multi-source interaction fusion feature representation model that simultaneously considers interactions among physicochemical properties, evolutionary information, and gap distances between residues. We also provide a selective ensemble algorithm based on gap distances that yields differential base classifiers by selecting the feature subspaces. The selective ensemble algorithm improves the generalization ability of the integrated classifiers. We then compare the proposed algorithms with some state-of-the-art methods using multiple datasets. The experimental results show that the proposed algorithms are competitive and effectively identify DNA-binding proteins. The major contributions of the present study are the establishment of a model and algorithm for feature representation that involves interaction efforts and the development of a selective ensemble classification algorithm based on parameter perturbation. The proposed algorithms can also be applied to other biological questions related to amino acid sequences.

[1]  B. Liu,et al.  PseDNA‐Pro: DNA‐Binding Protein Identification by Combining Chou’s PseAAC and Physicochemical Distance Transformation , 2015, Molecular informatics.

[2]  K. Chou,et al.  iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model , 2011, PloS one.

[3]  Robert Tjian,et al.  A cellular DNA-binding protein that activates eukaryotic transcription and DNA replication , 1987, Cell.

[4]  Lawrence D. Jackel,et al.  Limits on Learning Machine Accuracy Imposed by Data Quality , 1995, KDD.

[5]  J. Lieb,et al.  ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. , 2004, Genomics.

[6]  Yu-Dong Cai,et al.  Prediction of Saccharomyces cerevisiae protein functional class from functional domain composition , 2004, Bioinform..

[7]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[8]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[9]  R. Langlois,et al.  Boosting the prediction and understanding of DNA-binding domains from sequence , 2010, Nucleic acids research.

[10]  Bo Gao,et al.  Identification of DNA-binding proteins using multi-features fusion and binary firefly optimization algorithm , 2016, BMC Bioinformatics.

[11]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[12]  Xiaolong Wang,et al.  Identification of DNA-binding proteins by incorporating evolutionary information into pseudo amino acid composition via the top-n-gram approach , 2015, Journal of biomolecular structure & dynamics.

[13]  Wei-Zhi Wu,et al.  Three-way concept learning based on cognitive operators: An information fusion viewpoint , 2017, Int. J. Approx. Reason..

[14]  Shinn-Ying Ho,et al.  Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties , 2011, BMC Bioinformatics.

[15]  Shinn-Ying Ho,et al.  Design of accurate predictors for DNA-binding sites in proteins using hybrid SVM-PSSM method , 2007, Biosyst..

[16]  B. Liu,et al.  iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition , 2014, PloS one.

[17]  B. Liu,et al.  DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation , 2015, Scientific Reports.

[18]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[19]  Chun-Xia Zhang,et al.  Using Boosting to prune Double-Bagging ensembles , 2009, Comput. Stat. Data Anal..

[20]  P. N. Suganthan,et al.  DNA-Prot: Identification of DNA Binding Proteins from Protein Sequence Information using Random Forest , 2009, Journal of biomolecular structure & dynamics.

[21]  Jijun Tang,et al.  Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information , 2017, Inf. Sci..

[22]  Ling Jing,et al.  Predicting DNA- and RNA-binding proteins from sequences with kernel methods. , 2009, Journal of theoretical biology.

[23]  F. Cajone,et al.  4-Hydroxynonenal induces a DNA-binding protein similar to the heat-shock factor. , 1989, The Biochemical journal.

[24]  Mark Ptashne,et al.  Regulation of transcription: from lambda to eukaryotes. , 2005, Trends in biochemical sciences.

[25]  Jeffrey Skolnick,et al.  Efficient prediction of nucleic acid binding function from low-resolution protein structures. , 2006, Journal of molecular biology.

[26]  Xiaoqi Zheng,et al.  PSSP-RFE: Accurate Prediction of Protein Structural Class by Recursive Feature Extraction from PSI-BLAST Profile, Physical-Chemical Property and Functional Annotations , 2014, PloS one.

[27]  Bin Liu,et al.  Identification of DNA-binding proteins by auto-cross covariance transformation , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[28]  C. Chou,et al.  Crystal Structure of the Hyperthermophilic Archaeal DNA-Binding Protein Sso10b2 at a Resolution of 1.85 Angstroms , 2003, Journal of bacteriology.

[29]  Gajendra P. S. Raghava,et al.  Identification of DNA-binding proteins using support vector machines and evolutionary profiles , 2007, BMC Bioinformatics.

[30]  Shandar Ahmad,et al.  PSSM-based prediction of DNA binding sites in proteins , 2005, BMC Bioinformatics.

[31]  Kuo-Chen Chou,et al.  MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. , 2007, Biochemical and biophysical research communications.

[32]  K. Chou,et al.  PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. , 2008, Analytical biochemistry.

[33]  Harianto Tjong,et al.  DISPLAR: an accurate method for predicting DNA-binding sites on protein surfaces , 2007, Nucleic acids research.

[34]  Eddy Mayoraz,et al.  Improved Pairwise Coupling Classification with Correcting Classifiers , 1998, ECML.

[35]  Daniel Hernández-Lobato,et al.  An Analysis of Ensemble Pruning Techniques Based on Ordered Aggregation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Jitendra R Raol,et al.  Data Fusion Mathematics: Theory and Practice , 2015 .

[37]  Gavin Brown,et al.  Ensemble Learning , 2010, Encyclopedia of Machine Learning and Data Mining.

[38]  Liangjiang Wang,et al.  BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences , 2006, Nucleic Acids Res..

[39]  Wei Tang,et al.  Ensembling neural networks: Many could be better than all , 2002, Artif. Intell..

[40]  Yaoqi Zhou,et al.  Structure-based prediction of DNA-binding proteins by structural alignment and a volume-fraction corrected DFIRE-based energy function , 2010, Bioinform..

[41]  Yixue Li,et al.  Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. , 2006, Journal of theoretical biology.