ECFS-DEA: an ensemble classifier-based feature selection for differential expression analysis on expression profiles

Background Various methods for differential expression analysis have been widely used to identify features which best distinguish between different categories of samples. Multiple hypothesis testing may leave out explanatory features, each of which may be composed of individually insignificant variables. Multivariate hypothesis testing holds a non-mainstream position, considering the large computation overhead of large-scale matrix operation. Random forest provides a classification strategy for calculation of variable importance. However, it may be unsuitable for different distributions of samples. Results Based on the thought of using an e nsemble c lassifier, we develop a f eature s election tool for d ifferential e xpression a nalysis on expression profiles (i.e., ECFS-DEA for short). Considering the differences in sample distribution, a graphical user interface is designed to allow the selection of different base classifiers. Inspired by random forest, a common measure which is applicable to any base classifier is proposed for calculation of variable importance. After an interactive selection of a feature on sorted individual variables, a projection heatmap is presented using k-means clustering. ROC curve is also provided, both of which can intuitively demonstrate the effectiveness of the selected feature. Conclusions Feature selection through ensemble classifiers helps to select important variables and thus is applicable for different sample distributions. Experiments on simulation and realistic data demonstrate the effectiveness of ECFS-DEA for differential expression analysis on expression profiles. The software is available at http://bio-nefu.com/resource/ecfs-dea .

[1]  G. Pazour,et al.  Ror2 signaling regulates Golgi structure and transport through IFT20 for tumor invasiveness , 2017, Scientific Reports.

[2]  Lei Wang,et al.  Joint Covariate Detection on Expression Profiles for Identifying MicroRNAs Related to Venous Metastasis in Hepatocellular Carcinoma , 2017, Scientific Reports.

[3]  George I. Lambrou,et al.  The “Gene Cube”: A Novel Approach to Three-dimensional Clustering of Gene Expression Data , 2019 .

[4]  Quan Zou,et al.  O‐GlcNAcPRED‐II: an integrated classification algorithm for identifying O‐GlcNAcylation sites based on fuzzy undersampling and a K‐means PCA oversampling technique , 2018, Bioinform..

[5]  Bin Liu,et al.  Protein Fold Recognition by Combining Support Vector Machines and Pairwise Sequence Similarity Scores , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  John D. Storey,et al.  SAM Thresholding and False Discovery Rates for Detecting Differential Gene Expression in DNA Microarrays , 2003 .

[7]  Geoffrey I. Webb,et al.  Bioinformatic Approaches for Predicting substrates of Proteases , 2011, J. Bioinform. Comput. Biol..

[8]  Han Zhang,et al.  BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches , 2019, Nucleic acids research.

[9]  Geoffrey I. Webb,et al.  TANGLE: Two-Level Support Vector Regression Approach for Protein Backbone Torsion Angle Prediction from Primary Sequences , 2012, PloS one.

[10]  Jiu-Xin Tan,et al.  Evaluation of different computational methods on 5-methylcytosine sites identification , 2020, Briefings Bioinform..

[11]  Michael K. Ng,et al.  Feature weight estimation for gene selection: a local hyperlinear learning approach , 2014, BMC Bioinformatics.

[12]  Bin Liu,et al.  DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks , 2019, Briefings Bioinform..

[13]  Quan Zou,et al.  Incorporating Distance-based Top-n-gram and Random Forest to Identify Electron Transport Proteins. , 2019, Journal of proteome research.

[14]  Wei Chen,et al.  Predicting protein structural classes for low-similarity sequences by evaluating different features , 2019, Knowl. Based Syst..

[15]  Galit Shmueli,et al.  To Explain or To Predict? , 2010, 1101.0891.

[16]  Bin Liu,et al.  MotifCNN-fold: protein fold recognition based on fold-specific features extracted by motif-based convolutional neural networks , 2019, Briefings Bioinform..

[17]  Wei Chen,et al.  iProEP: A Computational Predictor for Predicting Promoter , 2019, Molecular therapy. Nucleic acids.

[18]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[19]  Bin Liu,et al.  Fold-LTR-TCP: protein fold recognition based on triadic closure principle , 2019, Briefings Bioinform..

[20]  Zhenguo Yuan,et al.  LncRNA PAPAS promotes hepatocellular carcinoma by interacting with miR‐188‐5p , 2019, Journal of cellular biochemistry.

[21]  Hui Ding,et al.  A Random Forest Sub-Golgi Protein Classifier Optimized via Dipeptide and Amino Acid Composition Features , 2019, Front. Bioeng. Biotechnol..

[22]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[23]  Quan Zou,et al.  ELM-MHC: An Improved MHC Identification Method with Extreme Learning Machine Algorithm. , 2019, Journal of proteome research.

[24]  Yue Zhang,et al.  Optimal combination of feature selection and classification via local hyperplane based learning strategy , 2015, BMC Bioinformatics.

[25]  Yanan Liu,et al.  JCD-DEA: a joint covariate detection tool for differential expression analysis on tumor expression profiles , 2019, BMC Bioinformatics.

[26]  Geoffrey I. Webb,et al.  PhosphoPredict: A bioinformatics tool for prediction of human kinase-specific phosphorylation substrates and sites by integrating heterogeneous feature selection , 2017, Scientific Reports.

[27]  Q. Zou,et al.  Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA , 2018, RNA.

[28]  Qixing Huang,et al.  Use of RNAi technology to develop a PRSV-resistant transgenic papaya , 2017, Scientific Reports.

[29]  Yan Wang,et al.  NCAPG2 overexpression promotes hepatocellular carcinoma proliferation and metastasis through activating the STAT3 and NF-κB/miR-188-3p pathways , 2019, EBioMedicine.

[30]  Gopal K. Kanji,et al.  100 statistical tests 3rd edition , 2006 .

[31]  Wei Chen,et al.  iPhoPred: A Predictor for Identifying Phosphorylation Sites in Human Protein , 2019, IEEE Access.

[32]  Angela M. Liu,et al.  microRNA-122 as a regulator of mitochondrial metabolic gene network in hepatocellular carcinoma , 2010, Molecular systems biology.

[33]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[34]  Sandrine Dudoit,et al.  Multiple Testing Procedures: the multtest Package and Applications to Genomics , 2005 .