Stable feature selection based on the ensemble L1-norm support vector machine for biomarker discovery

BackgroundLately, biomarker discovery has become one of the most significant research issues in the biomedical field. Owing to the presence of high-throughput technologies, genomic data, such as microarray data and RNA-seq, have become widely available. Many kinds of feature selection techniques have been applied to retrieve significant biomarkers from these kinds of data. However, they tend to be noisy with high-dimensional features and consist of a small number of samples; thus, conventional feature selection approaches might be problematic in terms of reproducibility.ResultsIn this article, we propose a stable feature selection method for high-dimensional datasets. We apply an ensemble L1-norm support vector machine to efficiently reduce irrelevant features, considering the stability of features. We define the stability score for each feature by aggregating the ensemble results, and utilize backward feature elimination on a purified feature set based on this score; therefore, it is possible to acquire an optimal set of features for performance without the need to set a specific threshold. The proposed methodology is evaluated by classifying the binary stage of renal clear cell carcinoma with RNA-seq data.ConclusionA comparison with established algorithms, i.e., a fast correlation-based filter, random forest, and an ensemble version of an L2-norm support vector machine-based recursive feature elimination, enabled us to prove the superior performance of our method in terms of classification as well as stability in general. It is also shown that the proposed approach performs moderately on high-dimensional datasets consisting of a very large number of features and a smaller number of samples. The proposed approach is expected to be applicable to many other researches aimed at biomarker discovery.

[1]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[2]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[3]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[4]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[5]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[6]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..

[7]  Jean-Philippe Vert,et al.  The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures , 2011, PloS one.

[8]  William Stafford Noble,et al.  Support vector machine , 2013 .

[9]  Melanie Hilario,et al.  Knowledge and Information Systems , 2007 .

[10]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[11]  Francis R. Bach,et al.  Bolasso: model consistent Lasso estimation through the bootstrap , 2008, ICML '08.

[12]  Robert Tibshirani,et al.  1-norm Support Vector Machines , 2003, NIPS.

[13]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[14]  Zengyou He,et al.  Stable Feature Selection for Biomarker Discovery , 2010, Comput. Biol. Chem..

[15]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[16]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[17]  Blaise Hanczar,et al.  Stability of Ensemble Feature Selection on High-Dimension and Low-Sample Size Data - Influence of the Aggregation Method , 2014, ICPRAM.

[18]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[19]  Zeenia Jagga,et al.  Classification models for clear cell renal carcinoma stage progression, based on tumor RNAseq expression trained supervised machine learning algorithms , 2014, BMC Proceedings.

[20]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[21]  Xiaoli Li,et al.  Multi-resolution independent component analysis for high-performance tumor classification and biomarker discovery , 2011, BMC Bioinformatics.

[22]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[23]  L. Sobin,et al.  TNM staging of renal cell carcinoma , 1997, Cancer.

[24]  Sohail Asghar,et al.  A REVIEW OF FEATURE SELECTION TECHNIQUES IN STRUCTURE LEARNING , 2013 .

[25]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[26]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[27]  Wen Huang,et al.  MTML-msBayes: Approximate Bayesian comparative phylogeographic inference from multiple taxa and multiple loci with rate heterogeneity , 2011, BMC Bioinformatics.

[28]  Ryuzo Azuma,et al.  Particle simulation approach for subcellular dynamics and interactions of biological molecules , 2006, First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06).

[29]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[30]  Yvan Saeys,et al.  Robust Feature Selection Using Ensemble Feature Selection Techniques , 2008, ECML/PKDD.

[31]  Rong Xu,et al.  A novel feature extraction approach for microarray data based on multi-algorithm fusion , 2015, Bioinformation.

[32]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[33]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[34]  Paul S. Bradley,et al.  Feature Selection via Concave Minimization and Support Vector Machines , 1998, ICML.