Ensemble Gene Selection Versus Single Gene Selection: Which Is Better?

One of the major challenges in bioinformatics is selecting the appropriate genes for a given problem, and moreover, choosing the best gene selection technique for this task. Many such techniques have been developed, each with its own characteristics and complexities. Recently, some works have addressed this by introducing ensemble gene selection, which is the process of performing multiple runs of gene selection and aggregating the results into a single final list. The question is, will ensemble gene selection improve the results over those obtained when using single gene selection techniques (e.g., filter-based gene selection techniques on their own without any ensemble approach)? We compare how five filter-based feature (gene) selection techniques work with and without a data diversity ensemble approach (using a single feature selection technique on multiple sampled datasets created from an original one) when used for building models to label cancerous cells (or predict cancer treatment response) based on gene expression levels. Eleven bioinformatics (gene microarray) datasets are employed, along with four feature subset sizes and five learners. Our results show that the techniques Fold Change Ratio and Information Gain will produce better classification results when an ensemble approach is applied, while Probability Ratio and Signal-to-Noise will, in general, perform better without the ensemble approach. For the Area Under the ROC (Receiver Operating Characteristics) Curve ranker, the classification results are similar with or without the ensemble approach. This is, to our knowledge, the first paper to comprehensively examine the difference between the ensemble and single approaches for gene selection in the biomedical and bioinformatics domains.

[1]  Ian B. Jeffery,et al.  Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data , 2006, BMC Bioinformatics.

[2]  Taghi M. Khoshgoftaar,et al.  A review of the stability of feature selection techniques for bioinformatics data , 2012, 2012 IEEE 13th International Conference on Information Reuse & Integration (IRI).

[3]  Taghi M. Khoshgoftaar,et al.  Mean Aggregation versus Robust Rank Aggregation for Ensemble Gene Selection , 2012, 2012 11th International Conference on Machine Learning and Applications.

[4]  Pedro Larrañaga,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004, Artif. Intell. Medicine.

[5]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[6]  Richard Baumgartner,et al.  Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions , 2003, Bioinform..

[7]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[8]  Jean-Philippe Vert,et al.  The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures , 2011, PloS one.

[9]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[10]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation (3rd Edition) , 2007 .

[11]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[12]  J. Nazuno Haykin, Simon. Neural networks: A comprehensive foundation, Prentice Hall, Inc. Segunda Edición, 1999 , 2000 .

[13]  Taghi M. Khoshgoftaar,et al.  Random forest: A reliable tool for patient response prediction , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).