Boosting the Concordance Index for Survival Data – A Unified Framework To Derive and Evaluate Biomarker Combinations

The development of molecular signatures for the prediction of time-to-event outcomes is a methodologically challenging task in bioinformatics and biostatistics. Although there are numerous approaches for the derivation of marker combinations and their evaluation, the underlying methodology often suffers from the problem that different optimization criteria are mixed during the feature selection, estimation and evaluation steps. This might result in marker combinations that are suboptimal regarding the evaluation criterion of interest. To address this issue, we propose a unified framework to derive and evaluate biomarker combinations. Our approach is based on the concordance index for time-to-event data, which is a non-parametric measure to quantify the discriminatory power of a prediction rule. Specifically, we propose a gradient boosting algorithm that results in linear biomarker combinations that are optimal with respect to a smoothed version of the concordance index. We investigate the performance of our algorithm in a large-scale simulation study and in two molecular data sets for the prediction of survival in breast cancer patients. Our numerical results show that the new approach is not only methodologically sound but can also lead to a higher discriminatory power than traditional approaches for the derivation of gene signatures.

[1]  F. Harrell,et al.  Regression modelling strategies for improved prognostic prediction. , 1984, Statistics in medicine.

[2]  James M. Robins,et al.  Unified Methods for Censored Longitudinal Data and Causality , 2003 .

[3]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[4]  Daniel B. Mark,et al.  TUTORIAL IN BIOSTATISTICS MULTIVARIABLE PROGNOSTIC MODELS: ISSUES IN DEVELOPING MODELS, EVALUATING ASSUMPTIONS AND ADEQUACY, AND MEASURING AND REDUCING ERRORS , 1996 .

[5]  Hans A. Kestler,et al.  On the validity of time-dependent AUC estimators , 2015, Briefings Bioinform..

[6]  Benjamin Hofner,et al.  Model-based boosting in R: a hands-on tutorial using the R package mboost , 2012, Computational Statistics.

[7]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[8]  Harald Binder,et al.  Boosting for high-dimensional time-to-event data with competing risks , 2009, Bioinform..

[9]  M. Schmid,et al.  The Importance of Knowing When to Stop , 2012, Methods of Information in Medicine.

[10]  Robert E. Schapire,et al.  The Strength of Weak Learnability , 1989, 30th Annual Symposium on Foundations of Computer Science.

[11]  M. Pencina,et al.  Overall C as a measure of discrimination in survival analysis: model specific population value and confidence interval estimation , 2004, Statistics in medicine.

[12]  J. Foekens,et al.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer , 2005, The Lancet.

[13]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[14]  Robert Tibshirani,et al.  Survival analysis with high-dimensional covariates , 2010, Statistical methods in medical research.

[15]  P. Bühlmann,et al.  Survival ensembles. , 2006, Biostatistics.

[16]  L. V. van't Veer,et al.  Cross‐validated Cox regression on microarray gene expression data , 2006, Statistics in medicine.

[17]  Torsten Hothorn,et al.  Flexible boosting of accelerated failure time models , 2008, BMC Bioinformatics.

[18]  K. Hess,et al.  An Empirical Study of Univariate and Genetic Algorithm-Based Feature Selection in Binary Classification with Microarray Data , 2006, Cancer informatics.

[19]  Elia Biganzoli,et al.  A time‐dependent discrimination index for survival data , 2005, Statistics in medicine.

[20]  Jian Huang,et al.  Regularized ROC method for disease classification and biomarker selection with microarray data , 2005, Bioinform..

[21]  B. Peter BOOSTING FOR HIGH-DIMENSIONAL LINEAR MODELS , 2006 .

[22]  J. Goeman L1 Penalized Estimation in the Cox Proportional Hazards Model , 2009, Biometrical journal. Biometrische Zeitschrift.

[23]  Ying Huang,et al.  Evaluating the ROC performance of markers for future events , 2008, Lifetime data analysis.

[24]  M. Gail,et al.  Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. , 1989, Journal of the National Cancer Institute.

[25]  Jiang Gui,et al.  Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data , 2005, Bioinform..

[26]  F. Harrell,et al.  Evaluating the yield of medical tests. , 1982, JAMA.

[27]  Jiang Gui,et al.  Partial Cox regression analysis for high-dimensional microarray gene expression data , 2004, ISMB/ECCB.

[28]  Weiqiang Dong On Bias , Variance , 0 / 1-Loss , and the Curse of Dimensionality RK April 13 , 2014 .

[29]  Ash A. Alizadeh,et al.  Gene Expression Signature of Fibroblast Serum Response Predicts Human Cancer Progression: Similarities between Tumors and Wounds , 2004, PLoS biology.

[30]  Torsten Hothorn,et al.  Boosting additive models using component-wise P-Splines , 2008, Comput. Stat. Data Anal..

[31]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[32]  I. Langner Survival Analysis: Techniques for Censored and Truncated Data , 2006 .

[33]  Benjamin Hofner,et al.  Generalized additive models for location, scale and shape for high dimensional data—a flexible approach based on boosting , 2012 .

[34]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1995, COLT '90.

[35]  Jürgen Wolf,et al.  CASPAR: a hierarchical Bayesian approach to predict survival times in cancer from gene expression data , 2006, Bioinform..

[36]  J. Bergh,et al.  Strong Time Dependence of the 76-Gene Prognostic Signature for Node-Negative Breast Cancer Patients in the TRANSBIG Multicenter Independent Validation Series , 2007, Clinical Cancer Research.

[37]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[38]  Xiao Song,et al.  Ranking prognosis markers in cancer genomic studies , 2011, Briefings Bioinform..

[39]  J. Cavanaugh Biostatistics , 2005, Definitions.

[40]  J. Peterse,et al.  Comparison of gene expression profiles predicting progression in breast cancer patients treated with tamoxifen , 2008, Breast Cancer Research and Treatment.

[41]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[42]  Zhanfeng Wang,et al.  Marker selection via maximizing the partial area under the ROC curve of linear risk scores. , 2011, Biostatistics.

[43]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[44]  P. Heagerty,et al.  Survival Model Predictive Accuracy and ROC Curves , 2005, Biometrics.

[45]  M. Pencina,et al.  On the C‐statistics for evaluating overall adequacy of risk prediction procedures with censored survival data , 2011, Statistics in medicine.

[46]  CASPAR: a hierarchical Bayesian approach to predict survival times in cancer from gene expression data , 2007, Bioinform..

[47]  G. Ridgeway The State of Boosting ∗ , 1999 .

[48]  S. Cummings,et al.  Mammographic Breast Density and the Gail Model for Breast Cancer Risk Prediction in a Screening Population , 2005, Breast Cancer Research and Treatment.

[49]  Arnoldo Frigessi,et al.  Combining Gene Signatures Improves Prediction of Breast Cancer Survival , 2011, PloS one.

[50]  Matthias Schmid,et al.  A comparison of estimators to evaluate the discriminatory power of time‐to‐event models , 2012, Statistics in medicine.

[51]  Yariv Yogev,et al.  Serum MicroRNAs Are Promising Novel Biomarkers , 2008, PloS one.

[52]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[53]  Hongzhe Li,et al.  Kernel Cox Regression Models for Linking Gene Expression Profiles to Censored Survival Data , 2002, Pacific Symposium on Biocomputing.

[54]  BinderHarald,et al.  Boosting for high-dimensional time-to-event data with competing risks , 2009 .

[55]  P. Bühlmann Boosting for high-dimensional linear models , 2006 .

[56]  Juri G. Gelovani,et al.  Methodological and practical challenges for personalized cancer therapies , 2011, Nature Reviews Clinical Oncology.

[57]  Torsten Hothorn,et al.  A PAUC-based Estimation Technique for Disease Classification and Biomarker Selection , 2012, Statistical applications in genetics and molecular biology.

[58]  G. Tutz,et al.  Generalized Additive Modeling with Implicit Variable Selection by Likelihood‐Based Boosting , 2006, Biometrics.

[59]  Peter Buhlmann Boosting for high-dimensional linear models , 2006, math/0606789.

[60]  R. Sun,et al.  A Novel Statistical Prognostic Score Model That Includes Serum CXCL5 Levels and Clinical Classification Predicts Risk of Disease Progression and Survival of Nasopharyngeal Carcinoma Patients , 2013, PloS one.