A boosting method for maximization of the area under the ROC curve

We discuss receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC) for binary classification problems in clinical fields. We propose a statistical method for combining multiple feature variables, based on a boosting algorithm for maximization of the AUC. In this iterative procedure, various simple classifiers that consist of the feature variables are combined flexibly into a single strong classifier. We consider a regularization to prevent overfitting to data in the algorithm using a penalty term for nonsmoothness. This regularization method not only improves the classification performance but also helps us to get a clearer understanding about how each feature variable is related to the binary outcome variable. We demonstrate the usefulness of score plots constructed componentwise by the boosting method. We describe two simulation studies and a real data analysis in order to illustrate the utility of our method.

[1]  M. Pepe,et al.  Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. , 2004, American journal of epidemiology.

[2]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[3]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[4]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[5]  Rocco A. Servedio,et al.  Boosting the Area under the ROC Curve , 2007, NIPS.

[6]  D. Bamber The area above the ordinal dominance graph and the area below the receiver operating characteristic graph , 1975 .

[7]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[8]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[9]  J. Copas,et al.  A class of logistic‐type discriminant functions , 2002 .

[10]  Jian Huang,et al.  Combining Multiple Markers for Classification Using ROC , 2007, Biometrics.

[11]  M. Schummer,et al.  Selecting Differentially Expressed Genes from Microarray Experiments , 2003, Biometrics.

[12]  M. Pepe,et al.  Combining diagnostic test results to increase accuracy. , 2000, Biostatistics.

[13]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[14]  Shinto Eguchi,et al.  Robustifying AdaBoost by Adding the Naive Error Rate , 2004, Neural Computation.

[15]  Jian Huang,et al.  Regularized ROC method for disease classification and biomarker selection with microarray data , 2005, Bioinform..

[16]  Optimal tuning parameter estimation in maximum penalized likelihood method , 2010 .

[17]  Theofanis Sapatinas,et al.  Discriminant Analysis and Statistical Pattern Recognition , 2005 .

[18]  Zhanfeng Wang,et al.  A parsimonious threshold-independent protein feature selection method through the area under receiver operating characteristic curve , 2007, Bioinform..

[19]  Xiao-Hua Zhou,et al.  Statistical Methods in Diagnostic Medicine , 2002 .

[20]  Trevor Hastie,et al.  Statistical Models in S , 1991 .

[21]  Trevor Hastie,et al.  Additive Logistic Regression : a Statistical , 1998 .

[22]  G. Tutz,et al.  Generalized Additive Modeling with Implicit Variable Selection by Likelihood‐Based Boosting , 2006, Biometrics.

[23]  M. Pepe The Statistical Evaluation of Medical Tests for Classification and Prediction , 2003 .

[24]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[25]  T. Cai,et al.  Combining Predictors for Classification Using the Area under the Receiver Operating Characteristic Curve , 2006, Biometrics.

[26]  S. Eguchi,et al.  An introduction to the predictive technique AdaBoost with a comparison to generalized additive models , 2005 .

[27]  Takafumi Kanamori,et al.  Information Geometry of U-Boost and Bregman Divergence , 2004, Neural Computation.

[28]  Margaret Sullivan Pepe,et al.  Combining Several Screening Tests: Optimality of the Risk Score , 2002, Biometrics.

[29]  Jun S. Liu,et al.  Linear Combinations of Multiple Diagnostic Markers , 1993 .