Prediction‐Based Structured Variable Selection through the Receiver Operating Characteristic Curves

In many clinical settings, a commonly encountered problem is to assess accuracy of a screening test for early detection of a disease. In these applications, predictive performance of the test is of interest. Variable selection may be useful in designing a medical test. An example is a research study conducted to design a new screening test by selecting variables from an existing screener with a hierarchical structure among variables: there are several root questions followed by their stem questions. The stem questions will only be asked after a subject has answered the root question. It is therefore unreasonable to select a model that only contains stem variables but not its root variable. In this work, we propose methods to perform variable selection with structured variables when predictive accuracy of a diagnostic test is the main concern of the analysis. We take a linear combination of individual variables to form a combined test. We then maximize a direct summary measure of the predictive performance of the test, the area under a receiver operating characteristic curve (AUC of an ROC), subject to a penalty function to control for overfitting. Since maximizing empirical AUC of the ROC of a combined test is a complicated nonconvex problem (Pepe, Cai, and Longton, 2006, Biometrics62, 221-229), we explore the connection between the empirical AUC and a support vector machine (SVM). We cast the problem of maximizing predictive performance of a combined test as a penalized SVM problem and apply a reparametrization to impose the hierarchical structure among variables. We also describe a penalized logistic regression variable selection procedure for structured variables and compare it with the ROC-based approaches. We use simulation studies based on real data to examine performance of the proposed methods. Finally we apply developed methods to design a structured screener to be used in primary care clinics to refer potentially psychotic patients for further specialty diagnostics and treatment.

[1]  P. Zhao,et al.  The composite absolute penalties family for grouped and hierarchical variable selection , 2009, 0909.0411.

[2]  B. Efron How Biased is the Apparent Error Rate of a Prediction Rule , 1986 .

[3]  M. First,et al.  Structured Clinical Interview for DSM-IV Axis I Disorders , 1997 .

[4]  Jian Huang,et al.  Regularized ROC method for disease classification and biomarker selection with microarray data , 2005, Bioinform..

[5]  Nancy A Obuchowski,et al.  An ROC‐type measure of diagnostic accuracy when the gold standard is continuous‐scale , 2006, Statistics in medicine.

[6]  Xiaodong Lin,et al.  Gene expression Gene selection using support vector machines with non-convex penalty , 2005 .

[7]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[8]  Cun-Hui Zhang,et al.  A group bridge approach for variable selection , 2009, Biometrika.

[9]  Gail Gong Cross-Validation, the Jackknife, and the Bootstrap: Excess Error Estimation in Forward Logistic Regression , 1986 .

[10]  T. Cai,et al.  Combining Predictors for Classification Using the Area under the Receiver Operating Characteristic Curve , 2006, Biometrics.

[11]  Yi Lin,et al.  An Efficient Variable Selection Approach for Analyzing Designed Experiments , 2007, Technometrics.

[12]  M. Weissman,et al.  Psychotic symptoms in an urban general medicine practice. , 2002, The American journal of psychiatry.

[13]  Bin Nan,et al.  Hierarchically penalized Cox regression with grouped variables , 2009 .

[14]  Aaron K. Han Non-parametric analysis of a generalized regression model: the maximum rank correlation estimator , 1987 .

[15]  Changyi Park,et al.  A Bahadur Representation of the Linear Support Vector Machine , 2008, J. Mach. Learn. Res..

[16]  H. Zou,et al.  Structured variable selection and estimation , 2009, 1011.0610.

[17]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[18]  M. Pepe The Statistical Evaluation of Medical Tests for Classification and Prediction , 2003 .

[19]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[20]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[21]  Susmita Datta,et al.  Predicting Patient Survival from Microarray Data by Accelerated Failure Time Modeling Using Partial Least Squares and LASSO , 2007, Biometrics.

[22]  Hansong Zhang,et al.  Gacv for support vector machines , 2000 .

[23]  Ming Tan,et al.  ROC‐Based Utility Function Maximization for Feature Selection and Classification with Applications to High‐Dimensional Protease Data , 2008, Biometrics.

[24]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[25]  Russell Zaretzki,et al.  The Skill Plot: A Graphical Technique for Evaluating Continuous Diagnostic Tests , 2007, Biometrics.

[26]  Axel Benner,et al.  penalizedSVM: a R-package for feature selection SVM classification , 2009, Bioinform..

[27]  Paul F. Pinsky,et al.  Scaling of True and Apparent ROC AUC with Number of Observations and Number of Variables , 2005 .

[28]  B. Efron The Estimation of Prediction Error , 2004 .

[29]  Szymon Jaroszewicz,et al.  Efficient AUC Optimization for Classification , 2007, PKDD.

[30]  Glenn Fung,et al.  A Feature Selection Newton Method for Support Vector Machine Classification , 2004, Comput. Optim. Appl..

[31]  P. Heagerty,et al.  Survival Model Predictive Accuracy and ROC Curves , 2005, Biometrics.

[32]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[33]  Ulf Brefeld,et al.  {AUC} maximizing support vector learning , 2005 .

[34]  D. Hand On Briggs and Zaretzki: The Skill Plot: A Graphical Technique for Evaluating Continuous Diagnostic Tests , 2007 .

[35]  P. Bebbington,et al.  Psychosis Screening Questionnaire , 2014 .

[36]  Jian Huang,et al.  Combining Multiple Markers for Classification Using ROC , 2007, Biometrics.

[37]  M S Pepe,et al.  Evaluating technologies for classification and prediction in medicine , 2005, Statistics in medicine.