ofw: An R Package to Select Continuous Variables for Multiclass Classification with a Stochastic Wrapper Method

When dealing with high dimensional and low sample size data, feature selection is often needed to help reduce the dimension of the variable space while optimizing the classification task. Few tools exist for selecting variables in such data sets, especially when classes are numerous (>2). We have developed ofw, an R package that implements, in the context of classification, the meta algorithm "optimal feature weighting". We focus on microarray data, although the method can be applied to any p >> n problems with continuous variables. The aim is to select relevant variables and to numerically evaluate the resulting variable selection. Two versions are proposed with the application of supervised multiclass classifiers such as classification and regression trees and support vector machines. Furthermore, a weighted approach can be chosen to deal with unbalanced multiclasses, a common characteristic in microarray data sets.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[3]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[4]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[5]  Jaques Reifman,et al.  Gene selection for multiclass prediction of microarray data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[6]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[7]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[8]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[9]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[10]  Kim-Anh Lê Cao,et al.  Multiclass classification and gene selection with a stochastic algorithm , 2009, Comput. Stat. Data Anal..

[11]  Laurent Younes,et al.  A Stochastic Algorithm for Feature Selection in Pattern Recognition , 2007, J. Mach. Learn. Res..

[12]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[13]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[14]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Philippe Besse,et al.  Selection of Biologically Relevant Genes with a Wrapper Stochastic Algorithm , 2007, Statistical applications in genetics and molecular biology.

[16]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[17]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[18]  Roger E Bumgarner,et al.  Correction: Multiclass classification of microarray data with repeated measurements: application to cancer , 2006, Genome Biology.