A review of the stability of feature selection techniques for bioinformatics data

Feature selection is an important step in data mining and is used in various domains including genetics, medicine, and bioinformatics. Choosing the important features (genes) is essential for the discovery of new knowledge hidden within the genetic code as well as the identification of important biomarkers. Although feature selection methods can help sort through large numbers of genes based on their relevance to the problem at hand, the results generated tend to be unstable and thus cannot be reproduced in other experiments. Relatedly, research interest in the stability of feature ranking methods has grown recently and researchers have produced experimental designs for testing the stability of feature selection, creating new metrics for measuring stability and new techniques designed to improve the stability of the feature selection process. In this paper, we will introduce the role of stability in feature selection with DNA microarray data. We list various ways of improving feature ranking stability, and discuss feature selection techniques, specifically explaining ensemble feature ranking and presenting various ensemble feature ranking aggregation methods. Finally, we discuss experimental procedures such as dataset perturbation, fixed overlap partitioning, and cross validation procedures that help researchers analyze and measure the stability of feature ranking methods. Throughout this work, we investigate current research in the field and discuss possible avenues of continuing such research efforts.

[1]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[2]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[3]  Yue Han,et al.  Stable Gene Selection from Microarray Data via Sample Weighting , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  Pedro Larrañaga,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004, Artif. Intell. Medicine.

[5]  Taghi M. Khoshgoftaar,et al.  Feature Selection with High-Dimensional Imbalanced Data , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[6]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[7]  Yvan Saeys,et al.  Robust Feature Selection Using Ensemble Feature Selection Techniques , 2008, ECML/PKDD.

[8]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[9]  Peter Kokol,et al.  Stability of Ranked Gene Lists in Large Microarray Analysis Studies , 2010, Journal of biomedicine & biotechnology.

[10]  Anthony Boral,et al.  Gene expression profiling and correlation with outcome in clinical trials of the proteasome inhibitor bortezomib. , 2006, Blood.

[11]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..

[12]  Bernard Zenko,et al.  Evaluation Method for Feature Rankings and their Aggregations for Biomarker Discovery , 2009, MLSB.

[13]  Caroline C. Friedel,et al.  Reliable gene signatures for microarray classification: assessment of stability and performance , 2006, Bioinform..

[14]  Richard Baumgartner,et al.  Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions , 2003, Bioinform..

[15]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Software Metrics Selection Using Support Vector Machine , 2011, SEKE.

[16]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[17]  Anne-Laure Boulesteix,et al.  Stability and aggregation of ranked gene lists , 2009, Briefings Bioinform..

[18]  Taghi M. Khoshgoftaar,et al.  A novel dataset-similarity-aware approach for evaluating stability of software metric selection techniques , 2012, 2012 IEEE 13th International Conference on Information Reuse & Integration (IRI).

[19]  Huan Liu,et al.  A Dilemma in Assessing Stability of Feature Selection Algorithms , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[20]  Jean Yee Hwa Yang,et al.  Gene-gene interaction filtering with ensemble of filters , 2011, BMC Bioinformatics.

[21]  D. DeMets,et al.  Biomarkers and surrogate endpoints: Preferred definitions and conceptual framework , 2001, Clinical pharmacology and therapeutics.

[22]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[23]  Seon-Young Kim,et al.  Effects of sample size on robustness and prediction accuracy of a prognostic gene signature , 2009, BMC Bioinformatics.

[24]  Albert Y. Zomaya,et al.  A Review of Ensemble Methods in Bioinformatics , 2010, Current Bioinformatics.

[25]  N. Arden,et al.  Quantifying stability in gene list ranking across microarray derived clinical biomarkers , 2011, BMC Medical Genomics.

[26]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[27]  P. Cunningham,et al.  Solutions to Instability Problems with Sequential Wrapper-based Approaches to Feature Selection , 2002 .

[28]  Jean-Philippe Vert,et al.  The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures , 2011, PloS one.

[29]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[30]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[31]  Melanie Hilario,et al.  Knowledge and Information Systems , 2007 .

[32]  Lei Liu,et al.  Ensemble gene selection by grouping for microarray data classification , 2010, J. Biomed. Informatics.

[33]  Taghi M. Khoshgoftaar,et al.  Random forest: A reliable tool for patient response prediction , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[34]  Lloyd A. Smith,et al.  Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper , 1999, FLAIRS.

[35]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Chris H. Q. Ding,et al.  Consensus group stable feature selection , 2009, KDD.

[37]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[38]  Zengyou He,et al.  Stable Feature Selection for Biomarker Discovery , 2010, Comput. Biol. Chem..

[39]  Shyam Visweswaran,et al.  Measuring Stability of Feature Selection in Biomedical Datasets , 2009, AMIA.

[40]  Taghi M. Khoshgoftaar,et al.  Stability Analysis of Feature Ranking Techniques on Biological Datasets , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine.

[41]  Chris H. Q. Ding,et al.  Stable feature selection via dense feature groups , 2008, KDD.