Stable classification with applications to microarray data

Abstract A stable classification method called minimum-error-distance threshold (MEDT) with variable selection is developed for the two-class prediction (classification) problem. First, a set of “significant” variables (genes) associated with the two classes is selected using the Wilcoxon rank-sum test, and then a data-driven cutoff point for a distance-based classification algorithm is determined by minimizing a combination of the rates of false positives and false negatives estimated by leave-one-out cross validation. This cutoff point is used to classify a given test set based on the selected variables. The proposed methodology is applied to the leukemia data set analyzed in Golub et al. (Science 286 (1999) 531). To compare the proposed methodology with the existing discrimination methods, the diagonal-linear-discriminant analysis and nearest-neighbor classifiers, 1000 cross validations are performed. The data set is randomly split into a training set consisting of 32 patients with acute lymphoblastic leukemia (ALL) and 16 with acute myeloid leukemia (AML) and a test set consisting of 15 patients with ALL and nine with AML. Performance summaries are calculated. A simulation study is conducted to demonstrate the superior stability of MEDT compared with that of the aforementioned existing methods. The stability measure used is the mean-to-standard deviation ratio of the number of correct predictions.