FKRR-MVSF: A Fuzzy Kernel Ridge Regression Model for Identifying DNA-Binding Proteins by Multi-View Sequence Features via Chou’s Five-Step Rule

DNA-binding proteins play an important role in cell metabolism. In biological laboratories, the detection methods of DNA-binding proteins includes yeast one-hybrid methods, bacterial singles and X-ray crystallography methods and others, but these methods involve a lot of labor, material and time. In recent years, many computation-based approachs have been proposed to detect DNA-binding proteins. In this paper, a machine learning-based method, which is called the Fuzzy Kernel Ridge Regression model based on Multi-View Sequence Features (FKRR-MVSF), is proposed to identifying DNA-binding proteins. First of all, multi-view sequence features are extracted from protein sequences. Next, a Multiple Kernel Learning (MKL) algorithm is employed to combine multiple features. Finally, a Fuzzy Kernel Ridge Regression (FKRR) model is built to detect DNA-binding proteins. Compared with other methods, our model achieves good results. Our method obtains an accuracy of 83.26% and 81.72% on two benchmark datasets (PDB1075 and compared with PDB186), respectively.

[1]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[2]  Xiaolong Wang,et al.  Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation , 2015, BMC Systems Biology.

[3]  Wei Chen,et al.  iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition , 2013, Nucleic acids research.

[4]  B. Liu,et al.  PseDNA‐Pro: DNA‐Binding Protein Identification by Combining Chou’s PseAAC and Physicochemical Distance Transformation , 2015, Molecular informatics.

[5]  Zhen Ji,et al.  Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set , 2014, BMC Bioinformatics.

[6]  Quan Zou,et al.  Incorporating Distance-based Top-n-gram and Random Forest to Identify Electron Transport Proteins. , 2019, Journal of proteome research.

[7]  Liang Fu,et al.  Using ensemble SVM to identify human GPCRs N-linked glycosylation sites based on the general form of Chou's PseAAC. , 2013, Protein engineering, design & selection : PEDS.

[8]  K. Chou Advance in predicting subcellular localization of multi-label proteins and its implication for developing multi-target drugs. , 2019, Current medicinal chemistry.

[9]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[10]  K. Chou,et al.  iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model , 2011, PloS one.

[11]  Yu. Pogoreltsev,et al.  The Application , 2020, How to Succeed in the Academic Clinical Interview.

[12]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[13]  Bo Jiang,et al.  Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes , 2014, PloS one.

[14]  K. Chou,et al.  iCar-PseCp: identify carbonylation sites in proteins by Monte Carlo sampling and incorporating sequence coupled effects into general PseAAC , 2016, Oncotarget.

[15]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[16]  Kuo-Chen Chou,et al.  SPalmitoylC-PseAAC: A sequence-based model developed via Chou's 5-steps rule and general PseAAC for identifying S-palmitoylation sites in proteins. , 2019, Analytical biochemistry.

[17]  B. Liu,et al.  iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition , 2014, PloS one.

[18]  Ran Su,et al.  Exploring sequence‐based features for the improved prediction of DNA N4‐methylcytosine sites in multiple species , 2018, Bioinform..

[19]  C. Zhang,et al.  Prediction of Membrane Protein Types Based on the Hydrophobic Index of Amino Acids , 2000, Journal of protein chemistry.

[20]  Bin Liu,et al.  HITS-PR-HHblits: protein remote homology detection by combining PageRank and Hyperlink-Induced Topic Search , 2018, Briefings Bioinform..

[21]  Kuo-Chen Chou,et al.  MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. , 2007, Biochemical and biophysical research communications.

[22]  Shih-Fu Chang,et al.  Fast kernel learning for spatial pyramid matching , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  K. Chou,et al.  Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. , 2015, Molecular bioSystems.

[24]  Kuo-Chen Chou,et al.  iPhosH-PseAAC: Identify Phosphohistidine Sites in Proteins by Blending Statistical Moments and Position Relative Features According to the Chou's 5-Step Rule and General Pseudo Amino Acid Composition , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[25]  K. Chou Progresses in Predicting Post-translational Modification , 2019, International Journal of Peptide Research and Therapeutics.

[26]  Swakkhar Shatabda,et al.  Effective DNA binding protein prediction by using key features via Chou's general PseAAC. , 2019, Journal of theoretical biology.

[27]  Hui Ding,et al.  iRNA(m6A)-PseDNC: Identifying N6-methyladenosine sites using pseudo dinucleotide composition. , 2018, Analytical biochemistry.

[28]  Quan Zou,et al.  O‐GlcNAcPRED‐II: an integrated classification algorithm for identifying O‐GlcNAcylation sites based on fuzzy undersampling and a K‐means PCA oversampling technique , 2018, Bioinform..

[29]  K. Chou,et al.  iRNA-Methyl: Identifying N(6)-methyladenosine sites using pseudo nucleotide composition. , 2015, Analytical biochemistry.

[30]  Christina S. Leslie,et al.  iDBPs: a web server for the identification of DNA binding proteins , 2010, Bioinform..

[31]  B. Liu,et al.  DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation , 2015, Scientific Reports.

[32]  Xiujun Gong,et al.  A Model Stacking Framework for Identifying DNA Binding Proteins by Orchestrating Multi-View Features and Classifiers , 2018, Genes.

[33]  K. Chou,et al.  iSNO-PseAAC: Predict Cysteine S-Nitrosylation Sites in Proteins by Incorporating Position Specific Amino Acid Propensity into Pseudo Amino Acid Composition , 2013, PloS one.

[34]  Xiangxiang Zeng,et al.  Prediction of potential disease-associated microRNAs using structural perturbation method , 2017, bioRxiv.

[35]  K. Chou,et al.  Prediction of the tertiary structure and substrate binding site of caspase‐8 , 1997, FEBS letters.

[36]  Q. Ning,et al.  dForml(KNN)-PseAAC: Detecting formylation sites from protein sequences using K-nearest neighbor algorithm via Chou's 5-step rule and pseudo components. , 2019, Journal of theoretical biology.

[37]  K. Chou Pseudo Amino Acid Composition and its Applications in Bioinformatics, Proteomics and System Biology , 2009 .

[38]  K. Chou Impacts of bioinformatics to medicinal chemistry. , 2015, Medicinal chemistry (Shariqah (United Arab Emirates)).

[39]  Q. Zou,et al.  Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA , 2018, RNA.

[40]  K. Chou,et al.  iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition. , 2015, Analytical biochemistry.

[41]  K. Chou,et al.  iRNA-3typeA: Identifying Three Types of Modification at RNA’s Adenosine Sites , 2018, Molecular therapy. Nucleic acids.

[42]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[43]  Xiangxiang Zeng,et al.  Prediction of Potential Disease-Associated MicroRNAs by Using Neural Networks , 2019, Molecular therapy. Nucleic acids.

[44]  Kuo-Chen Chou,et al.  Insights from modelling the 3D structure of the extracellular domain of alpha7 nicotinic acetylcholine receptor. , 2004, Biochemical and biophysical research communications.

[45]  David S. Goodsell,et al.  The RCSB Protein Data Bank: views of structural biology for basic and applied research and education , 2014, Nucleic Acids Res..

[46]  Xiangrong Liu,et al.  Application of Machine Learning in Microbiology , 2019, Front. Microbiol..

[47]  Kuo-Chen Chou,et al.  iPPI-Esml: An ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. , 2015, Journal of theoretical biology.

[48]  Wei Chen,et al.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition , 2014, Nucleic acids research.

[49]  Gajendra P. S. Raghava,et al.  Identification of DNA-binding proteins using support vector machines and evolutionary profiles , 2007, BMC Bioinformatics.

[50]  P. N. Suganthan,et al.  DNA-Prot: Identification of DNA Binding Proteins from Protein Sequence Information using Random Forest , 2009, Journal of biomolecular structure & dynamics.

[51]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[52]  Xiang Cheng,et al.  iDrug-Target: predicting the interactions between drug compounds and target proteins in cellular networking via benchmark dataset optimization approach , 2015, Journal of biomolecular structure & dynamics.

[53]  Kuo-Chen Chou,et al.  pSuc-Lys: Predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. , 2016, Journal of theoretical biology.

[54]  K. Chou,et al.  Prediction of the tertiary structure of a caspase‐9/inhibitor complex , 2000, FEBS letters.

[55]  Jijun Tang,et al.  Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information , 2017, Inf. Sci..

[56]  Fan Yang,et al.  iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC , 2018, Bioinform..

[57]  Jijun Tang,et al.  Prediction of human protein subcellular localization using deep learning , 2017, J. Parallel Distributed Comput..

[58]  Bin Liu,et al.  Pse-in-One 2.0: An Improved Package of Web Servers for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences , 2017 .

[59]  Xue-wen Chen,et al.  On Position-Specific Scoring Matrix for Protein Function Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[60]  Jijun Tang,et al.  Predicting protein-protein interactions via multivariate mutual information of protein sequences , 2016, BMC Bioinformatics.