Development of Two-Stage SVM-RFE Gene Selection Strategy for Microarray Expression Data Analysis

Extracting a subset of informative genes from microarray expression data is a critical data preparation step in cancer classification and other biological function analyses. Though many algorithms have been developed, the Support Vector Machine - Recursive Feature Elimination (SVM-RFE) algorithm is one of the best gene feature selection algorithms. It assumes that a smaller "filter-out" factor in the SVM-RFE, which results in a smaller number of gene features eliminated in each recursion, should lead to extraction of a better gene subset. Because the SVM-RFE is highly sensitive to the "filter-out" factor, our simulations have shown that this assumption is not always correct and that the SVM-RFE is an unstable algorithm. To select a set of key gene features for reliable prediction of cancer types or subtypes and other applications, a new two-stage SVM-RFE algorithm has been developed. It is designed to effectively eliminate most of the irrelevant, redundant and noisy genes while keeping information loss small at the first stage. A fine selection for the final gene subset is then performed at the second stage. The two-stage SVM-RFE overcomes the instability problem of the SVM-RFE to achieve better algorithm utility. We have demonstrated that the two-stage SVM-RFE is significantly more accurate and more reliable than the SVM-RFE and three correlation-based methods based on our analysis of three publicly available microarray expression datasets. Furthermore, the two-stage SVM-RFE is computationally efficient because its time complexity is $O(d * \log{_2d})$, where $d$ is the size of the original gene set.

[1]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[2]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[3]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[4]  S. Gunn Support Vector Machines for Classification and Regression , 1998 .

[5]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[6]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Michael R. Chernick,et al.  Bootstrap Methods: A Practitioner's Guide , 1999 .

[8]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[9]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[10]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[11]  J. Gołąb Interleukin 18--interferon gamma inducing factor--a novel player in tumour immunotherapy? , 2000, Cytokine.

[12]  I. Mian,et al.  Analysis of molecular profile data using generative and discriminative methods. , 2000, Physiological genomics.

[13]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[14]  Jason Weston,et al.  Gene functional classification from heterogeneous data , 2001, RECOMB.

[15]  Fabian Model,et al.  Feature selection for DNA methylation based cancer classification , 2001, ISMB.

[16]  P. Khatri,et al.  Profiling gene expression using onto-express. , 2002, Genomics.

[17]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[18]  B. Peace,et al.  Cross-talk between the receptor tyrosine kinases Ron and epidermal growth factor receptor. , 2003, Experimental cell research.

[19]  Isabelle Guyon,et al.  Statistical Learning and Kernel Methods in Bioinformatics , 2003 .

[20]  Purvesh Khatri,et al.  Onto-Tools, the toolkit of the modern biologist: Onto-Express, Onto-Compare, Onto-Design and Onto-Translate , 2003, Nucleic Acids Res..

[21]  Cesare Furlanello,et al.  Entropy-based gene ranking without selection bias for the predictive classification of microarray data , 2003, BMC Bioinformatics.

[22]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[23]  P. Khatri,et al.  Global functional profiling of gene expression ? ? This work was funded in part by a Sun Microsystem , 2003 .

[24]  Gregory Piatetsky-Shapiro,et al.  Microarray data mining: facing the challenges , 2003, SKDD.

[25]  Robert Tibshirani,et al.  Machine learning methods applied to DNA microarray data can improve the diagnosis of cancer , 2003, SKDD.

[26]  D. Ribatti,et al.  Targeted liposomal c-myc antisense oligodeoxynucleotides induce apoptosis and inhibit tumor growth and metastases in human melanoma models. , 2003, Clinical cancer research : an official journal of the American Association for Cancer Research.

[27]  Bernhard Schölkopf,et al.  Support Vector Machine Applications in Computational Biology , 2004 .

[28]  Jagath C. Rajapakse,et al.  A variant of SVM-RFE for gene selection in cancer classification with expression data , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[29]  Yihai Cao,et al.  Impaired Angiogenesis, Delayed Wound Healing and Retarded Tumor Growth in Perlecan Heparan Sulfate-Deficient Mice , 2004, Cancer Research.

[30]  G. Martinelli,et al.  Detection of serine 473 phosphorylated Akt in acute myeloid leukaemia blasts by flow cytometry , 2004, British journal of haematology.

[31]  Purvesh Khatri,et al.  Onto-Tools: an ensemble of web-accessible, ontology-based tools for the functional design and interpretation of high-throughput gene expression experiments , 2004, Nucleic Acids Res..

[32]  L. Greene,et al.  B-Myb and C-Myb Play Required Roles in Neuronal Apoptosis Evoked by Nerve Growth Factor Deprivation and DNA Damage , 2004, The Journal of Neuroscience.

[33]  Ximing J. Yang,et al.  Expression of RON Proto-oncogene in Renal Oncocytoma and Chromophobe Renal Cell Carcinoma , 2004, The American journal of surgical pathology.

[34]  B. Gallie,et al.  Retinoblastoma: Revisiting the model prototype of inherited cancer , 2004, American journal of medical genetics. Part C, Seminars in medical genetics.

[35]  L. Medeiros,et al.  Acute myeloid leukemia with t(6;9)(p23;q34) is associated with dysplasia and a high frequency of flt3 gene mutations. , 2004, American journal of clinical pathology.

[36]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[37]  Simon Parsons,et al.  Principles of Data Mining by David J. Hand, Heikki Mannila and Padhraic Smyth, MIT Press, 546 pp., £34.50, ISBN 0-262-08290-X , 2004, The Knowledge Engineering Review.

[38]  C. Marcocci,et al.  A reappraisal of the Rb1 gene abnormalities in the diagnosis of parathyroid cancer , 2004, Clinical endocrinology.

[39]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[40]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[41]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[42]  Purvesh Khatri,et al.  Recent additions and improvements to the Onto-Tools , 2005, Nucleic Acids Res..

[43]  Purvesh Khatri,et al.  Ontological analysis of gene expression data: current tools, limitations, and open problems , 2005, Bioinform..

[44]  R. Wadgaonkar,et al.  Endothelial cell myosin light chain kinase (MLCK) regulates TNFα‐induced NFκB activity , 2005 .

[45]  J. Novák,et al.  Proenzyme therapy of cancer. , 2005, Anticancer research.

[46]  Yiming Yang,et al.  Analysis of recursive gene selection approaches from microarray data , 2005, Bioinform..

[47]  N. Komatsu,et al.  TEL/ETV6 accelerates erythroid differentiation and inhibits megakaryocytic maturation in a human leukemia cell line UT‐7/GM , 2005, Cancer science.

[48]  Li-Min Fu Microarray Data Mining , 2009, Encyclopedia of Data Warehousing and Mining.