The Feature Compression Algorithms for Identifying Cytokines Based on CNT Features

As the signaling proteins, cytokines regulate a wide range of biological functions. It is important to distinguish the cytokines from other kinds of proteins. The 188-Dimensional CNT features are presented to identify the cytokines, which contain many redundant features. In this paper, we propose three kinds of feature compression algorithms to exclude the redundant features from the 188D features and keep the accuracy of the algorithm at the same time. The three algorithms are called the genetic based algorithm, the greedy based algorithm and the brute-force based algorithm. Experimental results demonstrate that the brute-force based algorithm gets the highest classification accuracy among the three algorithms. The genetic based algorithm achieves the least number of compressed features among the three algorithms. But they consume much more time than that consumed by the greedy based algorithm. The greedy based algorithm makes a good trade-off among the three factors, which are the classification accuracy, the number of compressed features and the time consumption.

[1]  Qinghua Hu,et al.  Multi-label feature selection with missing labels , 2018, Pattern Recognit..

[2]  Yong Huang,et al.  Identifying Multi-Functional Enzyme by Hierarchical Multi-Label Classifier , 2013 .

[3]  Xiangrong Liu,et al.  deepDR: a network-based deep learning approach to in silico drug repositioning , 2019, Bioinform..

[4]  Jijun Tang,et al.  Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou's general PseAAC. , 2019, Journal of theoretical biology.

[5]  Jian Huang,et al.  A Brief Survey of Machine Learning Methods in Protein Sub-Golgi Localization , 2019, Current Bioinformatics.

[6]  Wei Lin,et al.  A comprehensive overview and evaluation of circular RNA detection tools , 2017, PLoS Comput. Biol..

[7]  Shuigeng Zhou,et al.  Predicting Enhancers from Multiple Cell Lines and Tissues across Different Developmental Stages Based On SVM Method , 2018, Current Bioinformatics.

[8]  Kun-Huang Chen,et al.  An improved particle swarm optimization for feature selection , 2011, Intell. Data Anal..

[9]  Qinghua Guo,et al.  LncRNA2Target v2.0: a comprehensive database for target genes of lncRNAs in human and mouse , 2018, Nucleic Acids Res..

[10]  F. Wang,et al.  Methods of MicroRNA Promoter Prediction and Transcription Factor Mediated Regulatory Network , 2017, BioMed research international.

[11]  Bin Liu,et al.  BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches , 2019, Briefings Bioinform..

[12]  Yi Xiong,et al.  Protein-protein interface hot spots prediction based on a hybrid feature selection strategy , 2018, BMC Bioinformatics.

[13]  Zhiyong Zeng,et al.  Feature Selection Based on Dependency Margin , 2015, IEEE Transactions on Cybernetics.

[14]  Qinghua Hu,et al.  Subspace clustering guided unsupervised feature selection , 2017, Pattern Recognit..

[15]  Qinghua Hu,et al.  Combining neighborhood separable subspaces for classification via sparsity regularized optimization , 2016, Inf. Sci..

[16]  Liang Yu,et al.  Conserved Disease Modules Extracted From Multilayer Heterogeneous Disease and Gene Networks for Understanding Disease Mechanisms and Predicting Disease Treatments , 2019, Front. Genet..

[17]  Jiangning Song,et al.  ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides , 2018, Bioinform..

[18]  Yan Lin,et al.  iTerm-PseKNC: a sequence-based tool for predicting bacterial transcriptional terminators , 2018, Bioinform..

[19]  Guangmin Liang,et al.  SeqSVM: A Sequence-Based Support Vector Machine Method for Identifying Antioxidant Proteins , 2018, International journal of molecular sciences.

[20]  Fei Guo,et al.  Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier , 2017, Artif. Intell. Medicine.

[21]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[22]  Jack Y. Yang,et al.  Transcription factor and microRNA regulation in androgen-dependent and -independent prostate cancer cells , 2008, BMC Genomics.

[23]  Liang Yu,et al.  The extraction of drug-disease correlations based on module distance in incomplete human interactome , 2016, BMC Systems Biology.

[24]  Q. Zou,et al.  Similarity computation strategies in the microRNA-disease network: a survey. , 2015, Briefings in functional genomics.

[25]  Cong Shen,et al.  LPI-KTASLP: Prediction of LncRNA-Protein Interaction by Semi-Supervised Link Learning With Multivariate Information , 2019, IEEE Access.

[26]  G. Yen,et al.  A Consensus Community-Based Particle Swarm Optimization for Dynamic Community Detection , 2020, IEEE Transactions on Cybernetics.

[27]  Tao Zeng,et al.  Prediction of heme binding residues from protein sequences with integrative sequence profiles , 2012, Proteome Science.

[28]  Gaotao Shi,et al.  CPPred-RF: A Sequence-based Predictor for Identifying Cell-Penetrating Peptides and Their Uptake Efficiency. , 2017, Journal of proteome research.

[29]  Søren Brunak,et al.  A Neural Network Method for Identification of Prokaryotic and Eukaryotic Signal Peptides and Prediction of their Cleavage Sites , 1997, Int. J. Neural Syst..

[30]  Wei Chen,et al.  i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome , 2019, Bioinform..

[31]  Renzhi Cao,et al.  SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines , 2013, BMC Bioinformatics.

[32]  Wei Tao,et al.  A comprehensive comparison and analysis of computational predictors for RNA N6-methyladenosine sites of Saccharomyces cerevisiae. , 2019, Briefings in functional genomics.

[33]  Yi Xiong,et al.  PDC-SGB: Prediction of effective drug combinations using a stochastic gradient boosting algorithm. , 2017, Journal of theoretical biology.

[34]  Q. Zou,et al.  Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA , 2018, RNA.

[35]  Han Zhang,et al.  BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches , 2019, Nucleic acids research.

[36]  G. Bannon,et al.  Comparison of conventional FASTA identity searches with the 80 amino acid sliding window FASTA search for the elucidation of potential identities to known allergens. , 2007, Molecular nutrition & food research.

[37]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[38]  Sen Liang,et al.  A Review of Matched-pairs Feature Selection Methods for Gene Expression Data Analysis , 2018, Computational and structural biotechnology journal.

[39]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[40]  Q. Zou,et al.  Cancer Diagnosis Through IsomiR Expression with Machine Learning Method , 2016 .

[41]  Xiangxiang Zeng,et al.  Computing with viruses , 2016, Theor. Comput. Sci..

[42]  Xiangxiang Zeng,et al.  Probability-based collaborative filtering model for predicting gene–disease associations , 2017, BMC Medical Genomics.

[43]  Gaotao Shi,et al.  Fast Prediction of Protein Methylation Sites Using a Sequence-Based Feature Selection Technique , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[44]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[45]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[46]  Sneh Lata,et al.  CytoPred: a server for prediction and classification of cytokines. , 2008, Protein engineering, design & selection : PEDS.

[47]  Fu-Ying Dao,et al.  A comparison and assessment of computational method for identifying recombination hotspots in Saccharomyces cerevisiae , 2019, Briefings Bioinform..

[48]  Xiangxiang Zeng,et al.  Spiking Neural P Systems With Colored Spikes , 2018, IEEE Transactions on Cognitive and Developmental Systems.

[49]  K. Chou,et al.  iACP: a sequence-based tool for identifying anticancer peptides , 2016, Oncotarget.

[50]  Bin Liu,et al.  Fold-LTR-TCP: protein fold recognition based on triadic closure principle , 2019, Briefings Bioinform..

[51]  Lin Gao,et al.  Predicting Potential Drugs for Breast Cancer based on miRNA and Tissue Specificity , 2018, International journal of biological sciences.

[52]  Alfonso Rodríguez-Patón,et al.  Meta-Path Methods for Prioritizing Candidate Disease miRNAs , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[53]  Xiangxiang Zeng,et al.  Prediction and Validation of Disease Genes Using HeteSim Scores , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[54]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[55]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[56]  Dong-Qing Wei,et al.  PredT4SE-Stack: Prediction of Bacterial Type IV Secreted Effectors From Protein Sequences Using a Stacked Ensemble Method , 2018, Front. Microbiol..

[57]  Zoi I. Litou,et al.  A Novel method for GPCR recognition and family classification from sequence alone using signatures derived from profile hidden Markov models , 2003, SAR and QSAR in environmental research.

[58]  Q. Zou,et al.  A novel machine learning method for cytokine-receptor interaction prediction. , 2016, Combinatorial chemistry & high throughput screening.

[59]  B. Liu,et al.  An Approach for Identifying Cytokines Based on a Novel Ensemble Classifier , 2013, BioMed research international.

[60]  Huan Liu,et al.  Feature Selection: An Ever Evolving Frontier in Data Mining , 2010, FSDM.

[61]  Rong Chen,et al.  HBPred: a tool to identify growth hormone-binding proteins , 2018, International journal of biological sciences.

[62]  Jin Zhao,et al.  Drug repositioning based on triangularly balanced structure for tissue-specific diseases in incomplete interactome , 2017, Artif. Intell. Medicine.

[63]  Wei Chen,et al.  Recent Advances in Machine Learning Methods for Predicting Heat Shock Proteins. , 2019, Current drug metabolism.

[64]  Qinghua Hu,et al.  Co-regularized unsupervised feature selection , 2018, Neurocomputing.

[65]  Yadong Wang,et al.  Predicting human microRNA-disease associations based on support vector machine , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[66]  Jijun Tang,et al.  Identification of Protein–Protein Interactions via a Novel Matrix-Based Sequence Representation Model with Amino Acid Contact Information , 2016, International journal of molecular sciences.

[67]  Hui Ding,et al.  Is There Any Sequence Feature in the RNA Pseudouridine Modification Prediction Problem? , 2019, Molecular therapy. Nucleic acids.

[68]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt): an expanding universe of protein information , 2005, Nucleic Acids Res..

[69]  Liang Yu,et al.  Human Pathway-Based Disease Network , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[70]  Cheng Chen,et al.  SubMito-XGBoost: predicting protein submitochondrial localization by fusing multiple feature information and eXtreme gradient boosting , 2020, Bioinform..

[71]  Jiu-Xin Tan,et al.  Identification of hormone binding proteins based on machine learning methods. , 2019, Mathematical biosciences and engineering : MBE.

[72]  Jing Zhao,et al.  Using Machine Learning to Measure Relatedness Between Genes: A Multi-Features Model , 2019, Scientific Reports.

[73]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[74]  Xiangxiang Zeng,et al.  MOEA/HD: A Multiobjective Evolutionary Algorithm Based on Hierarchical Decomposition , 2019, IEEE Transactions on Cybernetics.

[75]  Alper Ekrem Murat,et al.  A discrete particle swarm optimization method for feature selection in binary classification problems , 2010, Eur. J. Oper. Res..

[76]  Hao Lv,et al.  Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique , 2018, Bioinform..

[77]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[78]  Guangmin Liang,et al.  A Novel Hybrid Sequence-Based Model for Identifying Anticancer Peptides , 2018, Genes.

[79]  Bin Liu,et al.  DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks , 2019, Briefings Bioinform..

[80]  Jijun Tang,et al.  Identification of drug-side effect association via multiple information integration with centered kernel alignment , 2019, Neurocomputing.

[81]  Yukimitsu Yabuki,et al.  GRIFFIN: a system for predicting GPCR–G-protein coupling selectivity using a support vector machine and a hidden Markov model , 2005, Nucleic Acids Res..

[82]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[83]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[84]  Xiangxiang Zeng,et al.  An Evolutionary Algorithm Based on Minkowski Distance for Many-Objective Optimization , 2019, IEEE Transactions on Cybernetics.

[85]  B. Liu,et al.  iRO-PsekGCC: Identify DNA Replication Origins Based on Pseudo k-Tuple GC Composition , 2019, Front. Genet..

[86]  Huan Liu,et al.  Manipulating Data and Dimension Reduction Methods: Feature Selection , 2009, Encyclopedia of Complexity and Systems Science.

[87]  Wei Chen,et al.  Predicting protein structural classes for low-similarity sequences by evaluating different features , 2019, Knowl. Based Syst..

[88]  Jenn-Kang Hwang,et al.  Prediction of protein subcellular localization , 2006, Proteins.

[89]  Qinghua Hu,et al.  Multi-view label embedding , 2018, Pattern Recognit..

[90]  Yi Xiong,et al.  PseUI: Pseudouridine sites identification based on RNA sequence information , 2018, BMC Bioinformatics.

[91]  H. Ding,et al.  Identification of mitochondrial proteins of malaria parasite using analysis of variance , 2014, Amino Acids.

[92]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[93]  R. Ji,et al.  Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[94]  Zhirong Sun,et al.  CTKPred: an SVM-based method for the prediction and classification of the cytokine superfamily. , 2005, Protein engineering, design & selection : PEDS.