Judging new markers by their ability to improve predictive accuracy.

The man who has recently received a radical prostatectomy to treat his clinically localized prostate cancer now faces an important decision: whether or not adjuvant therapy would be beneficial. Clearly, a major factor in this decision is the likelihood of his disease recurring in the absence of additional therapy. There are at least three well-documented prognostic models for use in this setting, and each predicts the likelihood of biochemical progression (i.e., prostate-specific antigen [PSA]-defined recurrence of prostate cancer). Partin et al. (1) developed an equation they called “Rw”; Blute et al. (2) devised the “GPSM” score (which includes the Gleason score, PSA level, seminal vesicle status, and margin status), and Kattan et al. (3) derived a postoperative nomogram, which was later validated by Graefen et al. (4). Which of these models predicts best for the individual patient? The GPSM score and the postoperative nomogram have been evaluated by the concordance index and have values of 0.76 and 0.80, respectively, suggesting relatively similar performance. The concordance index is the probability that, given two randomly selected patients, the patient with the worse outcome is, in fact, predicted to have a worse outcome (5). This measure, similar to an area under the receiver operating characteristic curve, ranges from 0.5 (i.e., chance or a coin flip) to 1.0 (perfect ability to rank patients). In this issue of the Journal, Rhodes et al. (6) have found that a novel marker, the E-cadherin and enhancer of zeste homolog 2 (EZH2) status, may provide additional prognostic ability in the postoperative prostate cancer disease setting. They have found that the interaction of E-cadherin and EZH2 is statistically significant in multivariable analysis (P .003) and has a hazard ratio of 3.19. This association may prove to have important biologic implications. However, from a prediction perspective, an important question should be asked of any new marker: How accurate is the best prediction model that contains the new marker relative to the best model that lacks it? That is, how much does the concordance index improve with knowledge of the patient’s novel marker? This increment is a direct gauge of the progress being made in our ability to predict patient outcome. Analyses that characterize markers by their impact on the predictive accuracy (e.g., as measured by a change in the concordance index) of a model are rare, but beneficial. Begg et al. (7) effectively did this when they compared three rival staging systems in thymoma. As Begg et al. point out, many prognostic factors contain little or no relevant information that is not already available when standard prognostic factors are combined optimally. For this reason, it is important to compare the best (i.e., most accurately predicting) models, with and without the marker of interest. When judging the value of a model containing a new marker, an important question is whether it is possible to achieve an equivalent concordance index by the optimal modeling of all predictors besides the novel marker. If so, the new marker has not improved our ability to predict patient outcome. Why should we change the way we ordinarily look at markers and instead compare the accuracies of two models? The reasons that the comparison must be model-based and that traditional reporting of P values and hazard ratios from multivariable analysis is inadequate are manyfold. First, an individual patient’s optimal prediction, in most cases, will come from a multivariable model. Rarely would a single marker, absent any modeling, be ideal for prediction. If a model of markers provides the most accurate prediction, we should be evaluating models of markers. Second, the P value tests whether the association with the marker is 0, which is not testing the question of direct interest: whether a new marker improves our ability to predict. As Simon (8) points out, these are different questions. Third, when examining the P value for a novel marker, this value may depend on how the other variables are considered in the multivariable model. For example, the use of cutoffs or transforms for the established marker(s) can affect the P value of the novel marker. A comparison of the best models with and without a marker of interest provides a more objective alternative, because the emphasis is shifted to predictive accuracies of the models; the modeling should be used that provides the most accurate predictions (e.g., maximizes the concordance index), an objective goal. This model comparison conveniently alleviates another problem, that of automated variable selection. Procedures such as backwards elimination tend to reduce the P values of variables that survive elimination (i.e., the P values of remaining variables tend to shrink as other variables are eliminated) (9). Thus, the concern when a marker has a small P value only after variable selection, and not when judged in the full model before variable selection, is largely solved because 1) automated variable selection procedures would be used only when they improve a model’s predictive ability [which is very rare (9)] and 2) the P value of the marker after variable selection would not be of direct interest. Interpretation of a novel marker’s hazard ratio, in an effort to judge the marker’s prognostic value, has similar drawbacks. The hazard ratio is dependent on the measurement scale of the marker, cutoff(s) used for the novel marker, and the manner in which established variables are modeled. The following case study illustrates why incremental model predictive accuracy is a valuable metric. A new marker, percentage of biopsy cores positive for prostate cancer, was recently analyzed for its ability to improve preoperative prediction of

[1]  Brian O'Sullivan,et al.  Prognostic factors in cancer. , 2003 .

[2]  F. Harrell,et al.  Evaluating the yield of medical tests. , 1982, JAMA.

[3]  Arul M Chinnaiyan,et al.  Multiplex biomarker approach for determining risk of prostate-specific antigen-defined recurrence of prostate cancer. , 2003, Journal of the National Cancer Institute.

[4]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[5]  F. Marshall,et al.  Selection of men at high risk for disease recurrence for experimental adjuvant therapy following radical prostatectomy. , 1995, Urology.

[6]  Daniel B. Mark,et al.  TUTORIAL IN BIOSTATISTICS MULTIVARIABLE PROGNOSTIC MODELS: ISSUES IN DEVELOPING MODELS, EVALUATING ASSUMPTIONS AND ADEQUACY, AND MEASURING AND REDUCING ERRORS , 1996 .

[7]  Robert L. Sutherland,et al.  Validation Study of the Accuracy of a Postoperative Nomogram for Recurrence After Radical Prostatectomy for Localized Prostate Cancer , 2002 .

[8]  Michael W Kattan,et al.  Prediction of progression: nomograms of clinical utility. , 2002, Clinical prostate cancer.

[9]  R. Simon Evaluating Prognostic Factor Studies , 2003 .

[10]  C. Begg,et al.  Comparing tumour staging and grading systems: a case study and a review of the issues, using thymoma as a model. , 2000, Statistics in medicine.

[11]  M. Kattan,et al.  Postoperative nomogram for disease recurrence after radical prostatectomy for prostate cancer. , 1999, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[12]  E. Bergstralh,et al.  Use of Gleason score, prostate specific antigen, seminal vesicle and margin status to predict biochemical failure after radical prostatectomy. , 2001, The Journal of urology.