Assessing the Value of Risk Predictions by Using Risk Stratification Tables

Key Summary Points Risk prediction models are statistical models used to predict the probability of an outcome on the basis of the values of 1 or more risk factors (markers). The accuracy of the model's predictions is typically summarized with statistics that describe the model's discrimination and calibration. Risk stratification tables are a more informative way to assess and compare the models. The tables illustrate the distribution of predictions across risk categories. That illustration allows users to assess 3 key measures of the models' value for guiding medical decisions: the models' calibration, ability to stratify people into clinically relevant risk categories, and accuracy at classifying patients into higher- and lower-risk categories. This information is contained in the margins of the risk stratification table rather than in its cells. The tables should only be used to compare risk prediction models when one of the models contains all of the markers that are contained in the other (nested models); they should not be used to compare models with different sets of markers (nonnested models). The table predictions require corrections when casecontrol data are used. The recent epidemiologic and clinical literature is filled with studies evaluating statistical models that predict risk for disease or some other adverse event (15). Because risk prediction models are intended to help patients and clinicians make decisions, evaluation of these models requires methods that differ from those used to assess models describing disease etiology. This is because the characteristics of the models are less important than their value for guiding decisions. Cook and colleagues (1, 6) recently proposed a new approach to evaluate risk prediction models: a risk stratification table. This methodology appropriately focuses on the key purpose of a risk prediction model, which is to classify individuals into clinically relevant risk categories, and it has therefore been widely adopted in the literature (24). We examine the risk stratification approach in detail in this article, identifying the relevant information that can be abstracted from a risk stratification table and cautioning against misuses of the method that frequently occur in practice. We use a recently published study of a breast cancer risk prediction model by Tice and colleagues (2) to illustrate the concepts. Background A risk prediction marker is any measure that is used to predict a person's risk for an event. It may be a quantitative measure, such as high-density lipoprotein cholesterol level, or a qualitative measure, such as family history of disease. Risk predictors are also risk factors, in the sense that they will necessarily be strongly associated with the risk for disease. But a large, significant association does not assure that the marker has value in predicting risk for many people. A risk prediction model is a statistical model that combines information from several markers. Common types include logistic regression models, Cox proportional hazard models, and classification trees. Each type of model produces a predicted risk for each person by using information in the model. Consider, for example, a model predicting breast cancer risk that includes age as the only predictor. The resulting risk prediction for a woman of a given age is simply the proportion of women her age who develop breast cancer. The woman's predicted risk will change if more information is included in the model. For instance, if family history information is added, her predicted risk will be the proportion of women her age and with her family history who develop breast cancer. The purpose of a risk prediction model is to accurately stratify individuals into clinically relevant risk categories. This risk information can be used to guide clinical or policy decisions, for example, about preventive interventions for persons or disease screening for subpopulations identified as high risk, or to select persons for inclusion in clinical trials. The value of a risk prediction model for guiding these kinds of decisions can be judged by the extent to which the risk calculated from the model reflects the fraction of persons in the population with actual events (its calibration); the proportions in which the population is stratified into clinically relevant risk categories (its stratification capacity); and the extent to which participants with events are assigned to high-risk categories and those without events are assigned to low-risk categories (its classification accuracy). Risk prediction models are commonly evaluated by using the receiver-operating characteristic (ROC) curve (4, 7), which is a standard tool for evaluating the discriminatory accuracy of diagnostic or screening markers. This curve shows the true-positive rate plotted against the false-positive rate for rules that classify persons by using risk thresholds that vary over allpossible values. Receiver-operating characteristic curves are generally not helpful for evaluating risk prediction models because they do not provide information about the actual risks that the models predict or about the proportion of participants who have high or low risk values. Moreover, when comparing ROC curves for 2 risk prediction models, the models are aligned according to their false-positive rates (that is, different risk thresholds are applied to the 2 models to achieve the same false-positive rate). This is clearly inappropriate. In addition, the area under the ROC curve or c-statistic, a commonly reported summary measure that can be interpreted as the probability that the predicted risk for a participant with an event is higher than that for a participant without an event, has little direct clinical relevance. Clinicians are never asked to compare risks for a pair of patientsone who will eventually have the event and one who will not. Neither the ROC curve nor the c-statistic relates to the practical task of predicting risks for clinical decision making. Cook and colleagues (1, 6) propose using risk stratification tables to evaluate the incremental value of a new marker, or the benefit of adding a new marker (for example, C-reactive protein), to an established set of risk predictors (for example, Framingham risk predictors, such as age, diabetes, cholesterol level, smoking, and low-density lipoprotein cholesterol levels). In these stratification tables, risks calculated from models with and without the new marker are cross-tabulated. This approach represents a substantial improvement over the use of ROC methodology because it displays the risks calculated by use of the model and the proportions of individuals in the population who are stratified into the risk groups. We will provide an example of this approach and show how information about model calibration, stratification capacity, and classification accuracy can be derived from a risk stratification table and used to assess the added value of a marker for clinical and health care policy decisions. Example Tice and colleagues (2) published a study that builds and evaluates a model for predicting breast cancer risk by using data from 1095484 women in a prospective cohort and incidence data from the Surveillance, Epidemiology, and End Results database. Age, race or ethnicity, family history, and history of breast biopsy were used to model risk with a Cox proportional hazard model. The study focused on the benefit of adding breast density information to the model. The hazard ratio for breast density in the multivariate model (extremely dense vs. almost entirely fat) was estimated as 4.2 for women younger than age 65 years and 2.2 for women age 65 years or older. This suggests that breast density is strongly associated with disease riskthat is, that breast cancer rates are higher among women with higher breast density. However, it does not describe the value of breast density for helping women make informed clinical decisions, which requires knowledge of the frequency distribution of breast density in the population. To evaluate the added value of breast density, Tice and colleagues defined 5-year breast cancer risk categories as low (##lt##1%), low to intermediate (1% to 1.66%), intermediate to high (1.67% to 2.5%), and high (##gt##2.5%). The 1.67% cutoff for intermediate risk was presumably chosen on the basis of recommendations by the American Society of Clinical Oncology (8) and the Canadian Task Force on Preventive Health Care (9) to counsel women with 5-year risks greater than this threshold about considering tamoxifen for breast cancer prevention. Tice and colleagues used a risk stratification table (Table 1) to compare risk prediction models with and without breast density. Table 1. Five-Year Risks for Breast Cancer as Predicted by Models That Do and Do Not Include Breast Density Calibration Assessing model calibration is an important first step in evaluating any risk prediction model. Good calibration is essential; it means that the model-predicted probability of an event for a person with specified predictor values is the same as or very close to the proportion of all persons in the population with those same predictor values who experience the event (10). With many predictors, and especially with continuous predictors, we cannot evaluate calibration at each possible predictor value because there are too few participants with exactly those values. Instead, the standard approach is to place persons within categories of predicted risk and to compare the category values with the observed event rates for participants in each category. The calibration of the risk prediction models for breast cancer can be assessed by comparing the proportions of events in the margins of Table 1 with the corresponding row and column labels. For the model without breast density, the proportions of observed events within each risk category are in the far-right Total column and they generally agree wit

[1]  Karla Kerlikowske,et al.  Using Clinical Factors and Mammographic Breast Density to Estimate Breast Cancer Risk: Development and Validation of a New Predictive Model , 2008, Annals of Internal Medicine.

[2]  J. Kassirer,et al.  Therapeutic decision making: a cost-benefit analysis. , 1975, The New England journal of medicine.

[3]  Nancy R Cook,et al.  The Effect of Including C-Reactive Protein in Cardiovascular Risk Prediction Models for Women , 2006, Annals of Internal Medicine.

[4]  B Langholz,et al.  Estimation of absolute risk from nested case-control data. , 1997, Biometrics.

[5]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[6]  Margaret T May,et al.  Regression Modelling Strategies with Applications to Linear Models, Logistic Regression, and Survival Analysis. Frank E Harrell Jr, New York: Springer 2001, pp. 568, $79.95. ISBN 0-387-95232-2. , 2002 .

[7]  Yingye Zheng,et al.  Integrating the predictiveness of a marker with its performance as a classifier. , 2007, American journal of epidemiology.

[8]  J. Copas Regression, Prediction and Shrinkage , 1983 .

[9]  M. Pencina,et al.  Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond , 2008, Statistics in medicine.

[10]  J. Ware,et al.  Comments on ‘Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond’ by M. J. Pencina et al., Statistics in Medicine (DOI: 10.1002/sim.2929) , 2008, Statistics in medicine.

[11]  R. Pyke,et al.  Logistic disease incidence models and case-control studies , 1979 .

[12]  J Benichou,et al.  Methods of inference for estimates of absolute risk derived from population-based case-control studies. , 1995, Biometrics.

[13]  C. Chatfield Model uncertainty, data mining and statistical inference , 1995 .

[14]  M. Levine,et al.  Chemoprevention of breast cancer. A joint guideline from the Canadian Task Force on Preventive Health Care and the Canadian Breast Cancer Initiative's Steering Committee on Clinical Practice Guidelines for the Care and Treatment of Breast Cancer. , 2001, CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne.

[15]  D. Levy,et al.  A Risk Score for Predicting Near-Term Incidence of Hypertension: The Framingham Heart Study , 2008, Annals of Internal Medicine.

[16]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[17]  M. Somerfield,et al.  American Society of Clinical Oncology technology assessment on breast cancer risk reduction strategies: tamoxifen and raloxifene. , 1999, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[18]  M S Pepe,et al.  Semiparametric methods for evaluating the covariate‐specific predictiveness of continuous markers in matched case–control studies , 2010, Journal of the Royal Statistical Society. Series C, Applied statistics.

[19]  M S Pepe,et al.  A Parametric Roc Model Based Approach for Evaluating the Predictiveness of Continuous Markers in Case-control Studies Suggested Citation , 2022 .

[20]  Ziding Feng,et al.  Evaluating the Predictiveness of a Continuous Marker , 2007, Biometrics.

[21]  M. Kattan,et al.  An Externally Validated Model for Predicting Long-Term Survival after Exercise Treadmill Testing in Patients with Suspected Coronary Artery Disease and a Normal Electrocardiogram , 2007, Annals of Internal Medicine.

[22]  A. Zwinderman,et al.  Role of the Apolipoprotein BApolipoprotein A-I Ratio in Cardiovascular Risk Assessment: A CaseControl Analysis in EPIC-Norfolk , 2007, Annals of Internal Medicine.

[23]  Daniel B. Mark,et al.  TUTORIAL IN BIOSTATISTICS MULTIVARIABLE PROGNOSTIC MODELS: ISSUES IN DEVELOPING MODELS, EVALUATING ASSUMPTIONS AND ADEQUACY, AND MEASURING AND REDUCING ERRORS , 1996 .

[24]  D. Levy,et al.  Multiple biomarkers for the prediction of first major cardiovascular events and death. , 2006, The New England journal of medicine.

[25]  F. Harrell,et al.  Prognostic/Clinical Prediction Models: Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors , 2005 .

[26]  N. Cook Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction , 2007, Circulation.

[27]  M. Pencina,et al.  Algorithms for assessing cardiovascular risk in women. , 2007, JAMA.

[28]  N. Cook Statistical evaluation of prognostic versus diagnostic models: beyond the ROC curve. , 2008, Clinical chemistry.

[29]  Algorithms for assessing cardiovascular risk in women. , 2007, JAMA.