A modified area under the ROC curve and its application to marker selection and classification

Abstract The area under the ROC curve (AUC) can be interpreted as the probability that the classification scores of a diseased subject is larger than that of a non-diseased subject for a randomly sampled pair of subjects. From the perspective of classification, we want to find a way to separate two groups as distinctly as possible via AUC. When the difference of the scores of a marker is small, its impact on classification is less important. Thus, a new diagnostic/classification measure based on a modified area under the ROC curve (mAUC) is proposed, which is defined as a weighted sum of two AUCs, where the AUC with the smaller difference is assigned a lower weight, and vice versa. Using mAUC is robust in the sense that mAUC gets larger as AUC gets larger as long as they are not equal. Moreover, in many diagnostic situations, only a specific range of specificity is of interest. Under normal distributions, we show that if the AUCs of two markers are within similar ranges, the larger mAUC implies the larger partial AUC for a given specificity. This property of mAUC will help to identify the marker with the higher partial AUC, even when the AUCs are similar. Two nonparametric estimates of an mAUC and their variances are given. We also suggest the use of mAUC as the objective function for classification, and the use of the gradient Lasso algorithm for classifier construction and marker selection. Application to simulation datasets and real microarray gene expression datasets show that our method finds a linear classifier with a higher ROC curve than some other existing linear classifiers, especially in the range of low false positive rates.

[1]  M. Pepe,et al.  Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. , 2004, American journal of epidemiology.

[2]  J A Hanley,et al.  Sampling variability of nonparametric estimates of the areas under receiver operating characteristic curves: an update. , 1997, Academic radiology.

[3]  Shinto Eguchi,et al.  A boosting method for maximizing the partial area under the ROC curve , 2010, BMC Bioinformatics.

[4]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[5]  Zhanfeng Wang,et al.  Marker selection via maximizing the partial area under the ROC curve of linear risk scores. , 2011, Biostatistics.

[6]  Yongdai Kim,et al.  A Gradient-Based Optimization Algorithm for LASSO , 2008 .

[7]  Jian Huang,et al.  Regularized ROC method for disease classification and biomarker selection with microarray data , 2005, Bioinform..

[8]  Jun S. Liu,et al.  Linear Combinations of Multiple Diagnostic Markers , 1993 .

[9]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[10]  N A Obuchowski,et al.  Confidence intervals for the receiver operating characteristic area in studies with small samples. , 1998, Academic radiology.

[11]  Yongdai Kim,et al.  Gradient LASSO for feature selection , 2004, ICML.

[12]  Jean L Freeman,et al.  A non-parametric method for the comparison of partial areas under ROC curves and its application to large health care data sets. , 2002, Statistics in medicine.

[13]  Lori E. Dodd,et al.  Partial AUC Estimation and Regression , 2003, Biometrics.

[14]  Andriy I. Bandos,et al.  Exact Bootstrap Variances of the Area Under ROC Curve , 2007 .

[15]  Zhanfeng Wang,et al.  A parsimonious threshold-independent protein feature selection method through the area under receiver operating characteristic curve , 2007, Bioinform..

[16]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[17]  M. Pepe,et al.  Combining diagnostic test results to increase accuracy. , 2000, Biostatistics.

[18]  Jian Huang,et al.  BMC Bioinformatics BioMed Central Methodology article Supervised group Lasso with applications to microarray data , 2007 .

[19]  D. Bamber The area above the ordinal dominance graph and the area below the receiver operating characteristic graph , 1975 .

[20]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.