A Comparative Study of Feature Selection Methods for Biomarker Discovery

A major area of research is biomarker discovery using gene expression data. Such data is huge and often needs to be classified into classes or clustered, using different machine learning techniques, for further analysis. An important preprocessing step is feature selection (FS) and different such methods have been devised. However, applying different FS techniques to the same dataset do not always produce the same results. In this work, the robustness of FS methods will be looked into. Robustness is defined here as the stability of a given gene pool with respect to the data and the FS method used. Our approach is to investigate the resulting feature subset obtained when running diverse FS methods on different gene expression datasets. As a first step, 10 FS methods were executed using 2 different datasets. Based on the results obtained, 2 of these methods were further investigated using 10 different datasets. The effects of selecting an increasing number of features on the percentage similarity inter-methods were also studied. Our results show that the studied methods exhibit a high amount of variability in the resulting feature subset. The selected feature subsets differed both inter-methods and intra-methods for different datasets. The reason behind this is not clear and possible objective assessment on the ideal (best) subset should be further investigated.

[1]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[2]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[4]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[5]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[6]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[7]  Zengyou He,et al.  Stable Feature Selection for Biomarker Discovery , 2010, Comput. Biol. Chem..

[8]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[9]  Jaideep Srivastava,et al.  Robust Feature Selection Technique Using Rank Aggregation , 2014, Appl. Artif. Intell..

[10]  David G. Stork,et al.  Pattern Classification , 1973 .

[11]  Gavin Brown,et al.  Measuring the Stability of Feature Selection with Applications to Ensemble Methods , 2015, MCS.

[12]  Duncan Fyfe Gillies,et al.  A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data , 2015, Adv. Bioinformatics.

[13]  Rainer Breitling,et al.  Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments , 2004, FEBS letters.

[14]  Zahra Mungloo-Dilmohamud,et al.  A Meta-Review of Feature Selection Techniques in the Context of Microarray Data , 2017, IWBBIO.

[15]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[16]  Masashi Sugiyama,et al.  Tree-Based Ensemble Multi-Task Learning Method for Classification and Regression , 2014, IEICE Trans. Inf. Syst..

[17]  Melanie Hilario,et al.  Knowledge and Information Systems , 2007 .

[18]  Verónica Bolón-Canedo,et al.  A review of microarray datasets and applied feature selection methods , 2014, Inf. Sci..

[19]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..

[20]  Magdalena Tkacz,et al.  Comparison of High-Level Microarray Analysis Methods in the Context of Result Consistency , 2015, PloS one.

[21]  Yvan Saeys,et al.  Robust Feature Selection Using Ensemble Feature Selection Techniques , 2008, ECML/PKDD.

[22]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[23]  P. Cunningham,et al.  Solutions to Instability Problems with Sequential Wrapper-based Approaches to Feature Selection , 2002 .

[24]  Shyam Visweswaran,et al.  Measuring Stability of Feature Selection in Biomedical Datasets , 2009, AMIA.

[25]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[26]  Miron B. Kursa,et al.  Robustness of Random Forest-based gene selection methods , 2013, BMC Bioinformatics.