Win percentage: a novel measure for assessing the suitability of machine classifiers for biological problems

Selecting an appropriate classifier for a particular biological application poses a difficult problem for researchers and practitioners alike. We propose a novel measure for assessing the suitability of machine classifiers for particular problems called "win percentage." We define win percentage as the probability a classifier will perform better than its peers on a finite random sample of feature sets, giving each classifier equal opportunity to find suitable features. We illustrate the utility of this method using synthetic data. Then, we evaluate six classifiers in analyzing eight microarray datasets representing three diseases: breast cancer, multiple myeloma, and neuroblastoma. Fundamentally, we illustrate that the selection of the most suitable classifier (i.e., one that is more likely to perform better than its peers) not only depends on the dataset and application but also on the thoroughness of feature selection. In particular, win percentage provides a single measurement that could assist users in eliminating or selecting classifiers for their particular application and will be accessible from www.biomiblab.org.

[1]  Todd H. Stokes,et al.  k-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction , 2010, The Pharmacogenomics Journal.

[2]  Ron Shamir,et al.  SlimPLS: A Method for Feature Selection in Gene Expression-Based Disease Classification , 2009, PloS one.

[3]  Kevin C. Dorff,et al.  The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models , 2010, Nature Biotechnology.

[4]  B. Chandra,et al.  An efficient statistical feature selection approach for classification of gene expression data , 2011, J. Biomed. Informatics.

[5]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[6]  Edward R. Dougherty,et al.  Performance of feature-selection methods in the classification of high-dimension data , 2009, Pattern Recognit..

[7]  David E. Goldberg,et al.  Genetic Algorithms, Selection Schemes, and the Varying Effects of Noise , 1996, Evolutionary Computation.

[8]  David E. Goldberg,et al.  Genetic Algorithms, Tournament Selection, and the Effects of Noise , 1995, Complex Syst..

[9]  Jan Komorowski,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm486 Data and text mining Monte Carlo , 2022 .

[10]  Lajos Pusztai,et al.  Determination of oestrogen-receptor status and ERBB2 status of breast carcinoma: a gene-expression profiling study. , 2007, The Lancet. Oncology.

[11]  Hong Luo,et al.  Random forest-based prediction of protein sumoylation sites from sequence features , 2010, BCB '10.

[12]  May D. Wang,et al.  Win percentage: a novel measure for assessing the suitability of machine classifiers for biological problems , 2011, BCB '11.

[13]  Yongsheng Huang,et al.  A validated gene expression model of high-risk multiple myeloma is defined by deregulated expression of genes mapping to chromosome 1. , 2006, Blood.

[14]  Patrick Warnat,et al.  Customized oligonucleotide microarray gene expression-based classification of neuroblastoma patients outperforms current clinical risk stratification. , 2006, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[15]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[16]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[17]  H. Harter Expected values of normal order statistics , 1961 .

[18]  Hakan Ferhatosmanoglu,et al.  Relationship preserving feature selection for unlabelled clinical trials time-series , 2010, BCB '10.

[19]  Ellis Horowitz,et al.  Computer Algorithms / C++ , 2007 .

[20]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[21]  Huan Liu,et al.  Feature Selection and Classification - A Probabilistic Wrapper Approach , 1996, IEA/AIE.