NCC-AUC: an AUC optimization method to identify multi-biomarker panel for cancer prognosis from genomic and clinical data

MOTIVATION In prognosis and survival studies, an important goal is to identify multi-biomarker panels with predictive power using molecular characteristics or clinical observations. Such analysis is often challenged by censored, small-sample-size, but high-dimensional genomic profiles or clinical data. Therefore, sophisticated models and algorithms are in pressing need. RESULTS In this study, we propose a novel Area Under Curve (AUC) optimization method for multi-biomarker panel identification named Nearest Centroid Classifier for AUC optimization (NCC-AUC). Our method is motived by the connection between AUC score for classification accuracy evaluation and Harrell's concordance index in survival analysis. This connection allows us to convert the survival time regression problem to a binary classification problem. Then an optimization model is formulated to directly maximize AUC and meanwhile minimize the number of selected features to construct a predictor in the nearest centroid classifier framework. NCC-AUC shows its great performance by validating both in genomic data of breast cancer and clinical data of stage IB Non-Small-Cell Lung Cancer (NSCLC). For the genomic data, NCC-AUC outperforms Support Vector Machine (SVM) and Support Vector Machine-based Recursive Feature Elimination (SVM-RFE) in classification accuracy. It tends to select a multi-biomarker panel with low average redundancy and enriched biological meanings. Also NCC-AUC is more significant in separation of low and high risk cohorts than widely used Cox model (Cox proportional-hazards regression model) and L1-Cox model (L1 penalized in Cox model). These performance gains of NCC-AUC are quite robust across 5 subtypes of breast cancer. Further in an independent clinical data, NCC-AUC outperforms SVM and SVM-RFE in predictive accuracy and is consistently better than Cox model and L1-Cox model in grouping patients into high and low risk categories. CONCLUSION In summary, NCC-AUC provides a rigorous optimization framework to systematically reveal multi-biomarker panel from genomic and clinical data. It can serve as a useful tool to identify prognostic biomarkers for survival analysis. AVAILABILITY AND IMPLEMENTATION NCC-AUC is available at http://doc.aporc.org/wiki/NCC-AUC. CONTACT ywang@amss.ac.cn SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[2]  Adam A. Margolin,et al.  Assessing the clinical utility of cancer genomic and proteomic data across tumor types , 2014, Nature Biotechnology.

[3]  W. Hong,et al.  Epidermal growth factor receptor, cyclooxygenase-2, and BAX expression in the primary non-small cell lung cancer and brain metastases. , 2003, Clinical cancer research : an official journal of the American Association for Cancer Research.

[4]  David R. Cox,et al.  Regression models and life tables (with discussion , 1972 .

[5]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[6]  N. Breslow,et al.  Analysis of Survival Data under the Proportional Hazards Model , 1975 .

[7]  J. Koziol,et al.  The Concordance Index C and the Mann–Whitney Parameter Pr(X>Y) with Randomly Censored Data , 2009, Biometrical journal. Biometrische Zeitschrift.

[8]  Xiang-Sun Zhang,et al.  Breast tumor subgroups reveal diverse clinical prognostic power , 2014, Scientific Reports.

[9]  F. Harrell,et al.  Evaluating the yield of medical tests. , 1982, JAMA.

[10]  Balaji Krishnapuram,et al.  On Ranking in Survival Analysis: Bounds on the Concordance Index , 2007, NIPS.

[11]  T. Lumley,et al.  Time‐Dependent ROC Curves for Censored Survival Data and a Diagnostic Marker , 2000, Biometrics.

[12]  John Calvin Reed,et al.  Immunohistochemical determination of in vivo distribution of Bax, a dominant inhibitor of Bcl-2. , 1994, The American journal of pathology.

[13]  Todd H. Stokes,et al.  k-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction , 2010, The Pharmacogenomics Journal.

[14]  F. Harrell,et al.  Prognostic/Clinical Prediction Models: Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors , 2005 .

[15]  Yi Hu,et al.  Three immunomarker support vector machines-based prognostic classifiers for stage IB non-small-cell lung cancer. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[16]  V. N. Nguyen,et al.  CD44 and its v6 spliced variant in lung carcinomas: relation to NCAM, CEA, EMA and UP1 and prognostic significance. , 2000, Neoplasma.

[17]  J. Bergh,et al.  Identification of molecular apocrine breast tumours by microarray analysis , 2005, Breast Cancer Research.

[18]  Bernice W. Polemis Nonparametric Statistics for the Behavioral Sciences , 1959 .

[19]  Y. Wang,et al.  Transfection of nm23-H1 increased expression of beta-Catenin, E-Cadherin and TIMP-1 and decreased the expression of MMP-2, CD44v6 and VEGF and inhibited the metastatic potential of human non-small cell lung cancer cell line L9981. , 2006, Neoplasma.

[20]  John W M Martens,et al.  Subtypes of breast cancer show preferential site of relapse. , 2008, Cancer research.

[21]  J. Cerhan,et al.  Prognostic significance of host immune gene polymorphisms in follicular lymphoma survival. , 2007, Blood.

[22]  J. Kern,et al.  The interactive effect of Ras, HER2, P53 and Bcl-2 expression in predicting the survival of non-small cell lung cancer patients. , 1998, Lung cancer.

[23]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[24]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[25]  C. Gessner [Detection of mutations of the K-ras gene in condensed breath of patients with non-small-cell lung carcinoma (NSCLC) as a possible noninvasive screening method]. , 1998, Pneumologie.

[26]  Bhavani Raskutti,et al.  Optimising area under the ROC curve using gradient descent , 2004, ICML.

[27]  Sabine Van Huffel,et al.  Improved performance on high-dimensional survival data by application of Survival-SVM , 2011, Bioinform..

[28]  Richard M. Simon,et al.  Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data , 2011, Briefings Bioinform..

[29]  Eytan Ruppin,et al.  Predicting Cancer-Specific Vulnerability via Data-Driven Detection of Synthetic Lethality , 2014, Cell.

[30]  G. Schmidt,et al.  The use of ROC for defining the validity of the prognostic index in censored data , 2011 .

[31]  David L Donoho,et al.  Compressed sensing , 2006, IEEE Transactions on Information Theory.

[32]  Y. Li,et al.  AUC-based biomarker ensemble with an application on gene scores predicting low bone mineral density , 2011, Bioinform..

[33]  Wei Chu,et al.  A Support Vector Approach to Censored Targets , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[34]  B. Efron Logistic Regression, Survival Analysis, and the Kaplan-Meier Curve , 1988 .

[35]  Tao Zeng,et al.  Phenotype-difference oriented identification of molecular functions for diabetes progression in Goto-Kakizaki rat , 2011, 2011 IEEE International Conference on Systems Biology (ISB).

[36]  J. Goeman L1 Penalized Estimation in the Cox Proportional Hazards Model , 2009, Biometrical journal. Biometrische Zeitschrift.

[37]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[39]  Kim-Anh Do,et al.  Bayesian ensemble methods for survival prediction in gene expression data , 2011, Bioinform..

[40]  B. Efron The Efficiency of Cox's Likelihood Function for Censored Data , 1977 .