Sparse and stable gene selection with consensus SVM-RFE

A method is described for performing sparse and stable gene selection from a number of unstable, but low cost, SVM-RFE units referred to as SVM-RFE subunits. Using a comprehensive simulation study, we show that the introduction of a consensus constraint with respect to variations in the policy of gene removal and a stability constraint with respect to perturbations in the training data can remarkably improve gene selection precision, dimensionality reduction ratio and stability of low cost SVM-RFE subunits still guaranteeing affordable computational costs. The method, which does not require the preselection of the number of selected genes, is divided into two stages. Multiple rough gene removal policies are first applied to multiple surrogate training datasets (spreading). Multiple consensus gene sets with respect to variations in the gene removal policy are then obtained and passed through a stability filter which selects the best performing gene set (despreading). Hence, while the consensus constraint performs strong dimensionality reduction at affordable computational costs, the stability constraint ensures acceptable indexes of gene selection stability and further dimensionality reduction. The method is validated on three benchmark microarray datasets.

[1]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Yanqing Zhang,et al.  Development of Two-Stage SVM-RFE Gene Selection Strategy for Microarray Expression Data Analysis , 2007, TCBB.

[3]  Hongyue Dai,et al.  Rosetta error model for gene expression analysis , 2006, Bioinform..

[4]  Fredric C. Gey,et al.  The relationship between recall and precision , 1994 .

[5]  Xing Qiu,et al.  Assessing stability of gene selection in microarray data analysis , 2006, BMC Bioinformatics.

[6]  J. Raser,et al.  Noise in Gene Expression: Origins, Consequences, and Control , 2005, Science.

[7]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[8]  Jana Novovicová,et al.  Evaluating Stability and Comparing Output of Feature Selectors that Optimize Feature Subset Cardinality , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Sayan Mukherjee,et al.  Classifying Microarray Data Using Support Vector Machines , 2003 .

[10]  Roger E Bumgarner,et al.  Multiclass classification of microarray data with repeated measurements: application to cancer , 2003, Genome Biology.

[11]  Eytan Domany,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2004, Breast Cancer Research.

[12]  Elisabetta Manduchi,et al.  Comparison of different labeling methods for two-channel high-density microarray experiments. , 2002, Physiological genomics.

[13]  Xin Zhou,et al.  MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data , 2007, Bioinform..

[14]  Jianqing Fan,et al.  High Dimensional Classification Using Features Annealed Independence Rules. , 2007, Annals of statistics.

[15]  Johan A. K. Suykens,et al.  Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction , 2004, Bioinform..

[16]  Y. Tu,et al.  Quantitative noise analysis for gene expression microarray experiments , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[17]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[18]  J.C. Rajapakse,et al.  SVM-RFE With MRMR Filter for Gene Selection , 2010, IEEE Transactions on NanoBioscience.

[19]  Michael D. Gordon,et al.  Recall-precision trade-off: A derivation , 1989, JASIS.

[20]  F. Azuaje,et al.  Multiple SVM-RFE for gene selection in cancer classification with expression data , 2005, IEEE Transactions on NanoBioscience.

[21]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[23]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[24]  G. Stolovitzky Gene selection in microarray data: the elephant, the blind men and our algorithms. , 2003, Current opinion in structural biology.

[25]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[26]  S Merk,et al.  Gene expression signature of primary imatinib-resistant chronic myeloid leukemia patients , 2006, Leukemia.

[27]  Marcel J. T. Reinders,et al.  A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets , 2006, BMC Bioinformatics.

[28]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[29]  Yoganand Balagurunathan,et al.  Simulation of cDNA microarrays via a parameterized random signal model. , 2002, Journal of biomedical optics.

[30]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..

[31]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.