Multiclass classification and gene selection with a stochastic algorithm

Microarray technology allows for the monitoring of thousands of gene expressions in various biological conditions, but most of these genes are irrelevant for classifying these conditions. Feature selection is consequently needed to help reduce the dimension of the variable space. Starting from the application of the stochastic meta-algorithm ''Optimal Feature Weighting'' (OFW) for selecting features in various classification problems, focus is made on the multiclass problem that wrapper methods rarely handle. From a computational point of view, one of the main difficulties comes from the unbalanced classes situation that is commonly encountered in microarray data. From a theoretical point of view, very few methods have been developed so far to minimize the classification error made on the minority classes. The OFW approach is developed to handle multiclass problems using CART and one-vs-one SVM classifiers. Comparisons are made with other multiclass selection algorithms such as Random Forests and the filter method F-test on five public microarray data sets with various complexities. Statistical relevancy of the gene selections is assessed by computing the performances and the stability of these different approaches and the results obtained show that the two proposed approaches are competitive and relevant to selecting genes classifying the minority classes. Application to a pig folliculogenesis study follows and a detailed interpretation of the genes that were selected shows that the OFW approach answers the biological question.

[1]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[2]  Jason Weston,et al.  Support vector machines for multi-class pattern recognition , 1999, ESANN.

[3]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[4]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[5]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[7]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[8]  Sounak Chakraborty,et al.  Computational Statistics and Data Analysis Simultaneous Cancer Classification and Gene Selection with Bayesian Nearest Neighbor Method: an Integrated Approach , 2022 .

[9]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[10]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Jason Weston Leave-One-Out Support Vector Machines , 1999, IJCAI.

[12]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[13]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[14]  Roger E Bumgarner,et al.  Correction: Multiclass classification of microarray data with repeated measurements: application to cancer , 2006, Genome Biology.

[15]  K. Sikora,et al.  Androgens and FSH affect androgen receptor and aromatase distribution in the porcine ovary. , 2003, Folia biologica.

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[18]  Jaques Reifman,et al.  Gene selection for multiclass prediction of microarray data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[19]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[20]  Philippe Besse,et al.  Selection of Biologically Relevant Genes with a Wrapper Stochastic Algorithm , 2007, Statistical applications in genetics and molecular biology.

[21]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[22]  Stephen S. Wilson,et al.  Random iterative models , 1996 .

[23]  Harold J. Kushner,et al.  wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[25]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[26]  Gary Weiss,et al.  Does cost-sensitive learning beat sampling for classifying rare classes? , 2005, UBDM '05.

[27]  Yufeng Liu,et al.  Adaptive Weighted Learning for Unbalanced Multicategory Classification , 2009, Biometrics.

[28]  Kim-Anh Lê Cao,et al.  ofw: An R Package to Select Continuous Variables for Multiclass Classification with a Stochastic Wrapper Method , 2008 .

[29]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[30]  S. Sathiya Keerthi,et al.  Which Is the Best Multiclass SVM Method? An Empirical Study , 2005, Multiple Classifier Systems.

[31]  Johannes Grotendorst,et al.  Classification of Highly Unbalanced CYP450 Data of Drugs Using Cost Sensitive Machine Learning Techniques. , 2007 .

[32]  Laurent Younes,et al.  A Stochastic Algorithm for Feature Selection in Pattern Recognition , 2007, J. Mach. Learn. Res..

[33]  Yoonkyung Lee,et al.  Classification of Multiple Cancer Types by Multicategory Support Vector Machines Using Gene Expression Data , 2003, Bioinform..

[34]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[35]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[36]  A. Bonnet,et al.  In vivo gene expression in granulosa cells during pig terminal follicular development. , 2008, Reproduction.

[37]  Chun-Xia Zhang,et al.  Using Boosting to prune Double-Bagging ensembles , 2009, Comput. Stat. Data Anal..