A boosting method for maximizing the partial area under the ROC curve

BackgroundThe receiver operating characteristic (ROC) curve is a fundamental tool to assess the discriminant performance for not only a single marker but also a score function combining multiple markers. The area under the ROC curve (AUC) for a score function measures the intrinsic ability for the score function to discriminate between the controls and cases. Recently, the partial AUC (pAUC) has been paid more attention than the AUC, because a suitable range of the false positive rate can be focused according to various clinical situations. However, existing pAUC-based methods only handle a few markers and do not take nonlinear combination of markers into consideration.ResultsWe have developed a new statistical method that focuses on the pAUC based on a boosting technique. The markers are combined componentially for maximizing the pAUC in the boosting algorithm using natural cubic splines or decision stumps (single-level decision trees), according to the values of markers (continuous or discrete). We show that the resulting score plots are useful for understanding how each marker is associated with the outcome variable. We compare the performance of the proposed boosting method with those of other existing methods, and demonstrate the utility using real data sets. As a result, we have much better discrimination performances in the sense of the pAUC in both simulation studies and real data analysis.ConclusionsThe proposed method addresses how to combine the markers after a pAUC-based filtering procedure in high dimensional setting. Hence, it provides a consistent way of analyzing data based on the pAUC from maker selection to marker combination for discrimination problems. The method can capture not only linear but also nonlinear association between the outcome variable and the markers, about which the nonlinearity is known to be necessary in general for the maximization of the pAUC. The method also puts importance on the accuracy of classification performance as well as interpretability of the association, by offering simple and smooth resultant score plots for each marker.

[1]  Margaret Sullivan Pepe,et al.  Combining Several Screening Tests: Optimality of the Risk Score , 2002, Biometrics.

[2]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[3]  M. Schummer,et al.  Selecting Differentially Expressed Genes from Microarray Experiments , 2003, Biometrics.

[4]  Jian Huang,et al.  Regularized ROC method for disease classification and biomarker selection with microarray data , 2005, Bioinform..

[5]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[6]  Ziv Bar-Joseph,et al.  Evaluation of different biological data and computational classification methods for use in protein interaction prediction , 2006, Proteins.

[7]  Tianxi Cai,et al.  Regression Analysis for the Partial Area Under the ROC Curve , 2006 .

[8]  Berkman Sahiner,et al.  Classification of malignant and benign masses based on hybrid ART2LDA approach , 1999, IEEE Transactions on Medical Imaging.

[9]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[10]  P. Bühlmann,et al.  Boosting With the L2 Loss , 2003 .

[11]  Peng Zhao,et al.  Stagewise Lasso , 2007, J. Mach. Learn. Res..

[12]  T. Cai,et al.  Combining Predictors for Classification Using the Area under the Receiver Operating Characteristic Curve , 2006, Biometrics.

[13]  Takafumi Kanamori,et al.  Information Geometry of U-Boost and Bregman Divergence , 2004, Neural Computation.

[14]  N. Cook Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction , 2007, Circulation.

[15]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[16]  G. Tutz,et al.  Generalized Additive Modeling with Implicit Variable Selection by Likelihood‐Based Boosting , 2006, Biometrics.

[17]  G. Lugosi,et al.  On the Bayes-risk consistency of regularized boosting methods , 2003 .

[18]  Zhanfeng Wang,et al.  A parsimonious threshold-independent protein feature selection method through the area under receiver operating characteristic curve , 2007, Bioinform..

[19]  D. Bamber The area above the ordinal dominance graph and the area below the receiver operating characteristic graph , 1975 .

[20]  Osamu Komori,et al.  A boosting method for maximization of the area under the ROC curve , 2011 .

[21]  R. Tibshirani,et al.  Generalized additive models for medical research , 1986, Statistical methods in medical research.

[22]  P. Bühlmann,et al.  Boosting with the L2-loss: regression and classification , 2001 .

[23]  J. Copas,et al.  A class of logistic‐type discriminant functions , 2002 .

[24]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[25]  S. Baker The central role of receiver operating characteristic (ROC) curves in evaluating tests for the early detection of cancer. , 2005, Journal of the National Cancer Institute.

[26]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[27]  M. Pepe The Statistical Evaluation of Medical Tests for Classification and Prediction , 2003 .

[28]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[29]  G. Lugosi,et al.  Complexity regularization via localized random penalties , 2004, math/0410091.

[30]  Lori E. Dodd,et al.  Partial AUC Estimation and Regression , 2003, Biometrics.

[31]  M. Pepe,et al.  Combining diagnostic test results to increase accuracy. , 2000, Biostatistics.

[32]  B. Yu,et al.  Boosting with the L_2-Loss: Regression and Classification , 2001 .

[33]  M. Pepe,et al.  Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. , 2004, American journal of epidemiology.

[34]  Peter Bühlmann,et al.  Boosting for Tumor Classification with Gene Expression Data , 2003, Bioinform..

[35]  Nir Friedman,et al.  Tissue classification with gene expression profiles , 2000, RECOMB '00.

[36]  M. Pencina,et al.  Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond , 2008, Statistics in medicine.