Assessing the Generalizability of Prognostic Information

Your secretary just called to say that a favorite patient of yours (a 45-year-old high school teacher) with colon cancer dropped off his surgical and pathologic reports. She reminds you that he is scheduled this afternoon for a second opinion about his prognosis. Scanning the reports, you note that the surgeon staged the cancer at Dukes stage C1 on the basis of negative margins and 1 positive out of 28 lymph nodes and that the pathologist staged the microscopic tissue at Jass stage IV. You are not sure how to translate these stages into useful prognostic information for your patient. Time is short. You log on to MEDLINE, type Dukes and Jass, and select the years 1966 to 1997. The combined terms generate 18 references (1-18). Of these, 4 are independent reports of observed mortality rates (8, 11, 17, 19). You also track down the original reports of the Dukes and Jass systems (20, 21) and learn that both systems were developed at St. Mark's Hospital in London more than 30 years ago (20, 21). The Dukes system is based on histology and extent of local, lymphatic, and venous spread seen in the surgical specimen (20), and the Jass system is based on microscopic pathologic staging (21). However, the reported mortality rates by stage for these systems vary widely. How do you tell which report and which system are most likely to pertain to a 45-year-old teacher from Cleveland, Ohio? Physicians are frequently asked for prognostic assessments and often worry that their assessments will prove inaccurate (22, 23). Prognostic systems, including risk factors, staging systems, decision rules, statistical models, and computer algorithms, have been developed to standardize and enhance the accuracy of prognostic assessments (24, 25). Although diverse techniques are used to develop these systems, all use a sample of patients for whom the outcome is known to relate baseline characteristics to an outcome of interest. Once a system is developed, it can be used to generate predictions for patients whose outcome is not yet known. A common problem in the application of prognostic systems is that the accuracy of the predictions degrades from the sample in which the system was first developed to subsequent application; that is, the systems do not generalize (26). Although much has been written on the evaluation and reporting of prognostic systems (25, 27-40), few investigations have directly addressed the issue of generalizability, also known as external validity (29). Instead, discussion has focused on evaluating issues of internal validity, such as the sample in which the system was developed, the variables used in the model, the techniques used in system development, or the accuracy of the system in the sample in which it was developed. These factors offer important insights into the probable generalizability of the system to a new sample of patients, but they do not directly test subsequent performance (25). We discuss the importance of systematically testing the subsequent performance of a system. We begin by defining the relation between accuracy and generalizability, components of accuracy (calibration and discrimination), and components of generalizability (reproducibility and transportability). We then discuss issues of transportability and propose a five-level hierarchy of external validity based on the type and degree of transportability tested. We illustrate this approach with a structured review of the Dukes and Jass staging systems for colon and rectal cancer as applied to the 45-year-old teacher described previously. Because this approach treats prognostic system development as a black box and focuses on subsequent performance, it can be applied to any prognostic system, no matter how complex. Accuracy and Generalizability Accuracy and generalizability are related concepts (Table 1). Accuracy is the degree to which predictions match outcomes. Generalizability is the ability of the system to provide accurate predictions in a different sample of patients. Table 1. Definitions of Accuracy and Generalizability Components of Accuracy A series of numeric predictions may be inaccurate in two ways. The predicted probability may be too high or too low (an error in calibration), or the relative ranking of individual risk may be out of order (an error in discrimination). Assume that we have observed a sample of patients with colon cancer for whom the overall 5-year mortality rate was 50%. A system that predicted a 50% probability of death at 5 years for each patient would be perfectly calibrated. However, it would not discriminate among patients who lived and those who died within the interval. Conversely, a system that assigned a 10% probability of death at 5 years to patients who lived and an 11% probability to those who died would be perfectly discriminating but poorly calibrated. The relative importance of calibration and discrimination depends on the intended application. If predictions are used to counsel a patient, the accuracy of the numeric probability (calibration) is important. Patients are not concerned about how sick they are relative to other patients with the disease; instead, they are concerned with the likelihood that their disease will result in death or some other important outcome (such as, in the case of our hypothetical patient, the inability to handle the challenges of teaching) within a defined period of time. Calibration is also important in health services research. When predicted and observed mortality rates are compared to identify unexpectedly high or low rates, errors in calibration can cause large numbers of hospitals or providers to appear to have excessively high or low rates of mortality when, in fact, the model is not calibrated (41). In contrast, if predictions are used to stratify patients by stage of severity in order to compare treatments within a given stage, the important aspect of accuracy is whether patients whose disease is within a stage are equally likely to experience the outcome and that the stages are correctly ranked in order of risk (discrimination) (41). Calibration and discrimination are evaluated in different ways. Calibration is not routinely measured but can be illustrated by using calibration curves, which plot predicted versus observed outcomes (42). Discrimination is commonly measured by using the area under the receiver-operating characteristic (ROC) curve (43). This area ranges from 0.5 (no discrimination) to 1.0 (perfect discrimination) and reflects the probability that in all possible pairs of patients in which one patient lives and one dies, a higher risk is assigned to the patient who died than to the one who lived. The area under the ROC curve can be directly calculated from a table of observed and predicted outcomes (44). It can also be calculated for continuous prognostic estimates (with or without censored observations) by using the C statistic (37). Several sources provide more thorough discussion of measures of calibration and discrimination (37, 41-43, 45-49). Components of Generalizability No matter how calibrated and discriminating a system may be in development, a system that can only predict outcomes in the sample in which it was developed is useless (25, 50). For a system to be generalizable, the accuracy (that is, calibration and discrimination) of the system must be both reproducible and transportable. Reproducibility Reproducibility requires the system to replicate its accuracy in patients who were not included in development of the system but who are from the same underlying population. A test of the reproducibility of the system evaluates the degree to which the system is fit to real patterns in the data rather than to random noise. The system is more likely to be fit to random noise (overfit) when the ratio of the number of variables to the number of patients experiencing events is small (37). If a prognostic system is overfit, it may not generalize well. Methods for evaluating the reproducibility of a prognostic system have been thoroughly described elsewhere (32, 35, 51, 52) and are based on the use of data resampling techniques (such as bootstrapping) to evaluate the degree of overfitting. Bootstrapping techniques can evaluate errors in discrimination and in calibration and are particularly important when the sample used to develop the model is small (35, 52). Transportability Alternatively, a system may be reproducible (that is, perform well when tested by bootstrapping) and degrade in subsequent patient samples because of underfitting (37, 53). Underfitting occurs when important independent predictors of outcome are omitted from the system. For example, a system for breast cancer that omits the presence of metastasis may perform well in a sample of patients with no metastatic disease and degrade badly when it is tested in a more diverse sample. Bootstrapping of the development sample would not detect this problem because the development sample is homogeneous with respect to metastatic disease. Omission of metastasis in breast cancer staging is, of course, an obvious mistake; however, not all important prognostic variables in a disease state are known. Because any given sample may be homogeneous for an important variable (known or unknown), it is important to test the system in subsequent samples. Transportability requires the system to produce accurate predictions in a sample drawn from a different but plausibly related population or in data collected by using slightly different methods than those used in the development sample. In the case of our patient, we want to know whether staging systems developed at St. Mark's Hospital in London more than 30 years ago pertain to a 45-year-old teacher from Cleveland whose disease was staged at Case Western Reserve University Hospital in 1998. Systems that are underfit may demonstrate reproducibility but not consistent transportability (50). Therefore, data from a separate, nonidentical sample are needed t

[1]  R. Conant,et al.  Reproducibility of Predictor Variables from a Validated Clinical Rule , 1992, Medical decision making : an international journal of the Society for Medical Decision Making.

[2]  Jeremy Wyatt,et al.  Nervous about artificial neural networks? , 1995, The Lancet.

[3]  M. Tötsch,et al.  Silver stained nucleolar organizer region proteins (Ag‐NORs) as a predictor of prognosis in colonic cancer , 1990, The Journal of pathology.

[4]  T. Smith,et al.  Telling the truth about terminal cancer. , 1998, JAMA.

[5]  E. Fulcheri,et al.  [The prognostic value of Jass' histopathological classification of cancer of the left colon and rectum]. , 1990, Minerva chirurgica.

[6]  R. Cohen,et al.  Assessment of invasive growth pattern and lymphocytic infiltration in colorectal cancer , 1996, Histopathology.

[7]  J. Jass,et al.  A NEW PROGNOSTIC CLASSIFICATION OF RECTAL CANCER , 1987, The Lancet.

[8]  R. Dales,et al.  Framing effects on expectations, decisions, and side effects experienced: the case of influenza immunization. , 1996, Journal of clinical epidemiology.

[9]  R M Centor,et al.  A Visicalc Program for Estimating the Area Under a Receiver Operating Characteristic (ROC) Curve , 1985, Medical decision making : an international journal of the Society for Medical Decision Making.

[10]  L. Hyde,et al.  MEDISGRPS: a clinically based approach to classifying hospital patients at admission. , 1985, Inquiry : a journal of medical care organization, provision and financing.

[11]  G. Brier,et al.  External correspondence: Decompositions of the mean probability score , 1982 .

[12]  J. Habbema,et al.  The measurement of performance in probabilistic diagnosis. III. Methods based on continuous functions of the diagnostic probabilities. , 1978, Methods of information in medicine.

[13]  R Simon,et al.  Why predictive indexes perform less well in validation studies. Is it magic or methods? , 1987, Archives of internal medicine.

[14]  D. Wagner,et al.  Daily prognostic estimates for critically ill adults in intensive care units: Results from a prospective, multicenter, inception cohort analysis , 1994, Critical care medicine.

[15]  W. Baxt Application of artificial neural networks to clinical medicine , 1995, The Lancet.

[16]  F Alemi,et al.  Predicting In-Hospital Survival of Myocardial Infarction: A Comparative Study of Various Severity Measures , 1990, Medical care.

[17]  T. Brewin Three ways of giving bad news , 1991, The Lancet.

[18]  G. Lapertosa,et al.  Chromogranin-A Expression in Neoplastic Neuroendocrine Cells and Prognosis in Colorectal Cancer , 1996, Tumori.

[19]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[20]  Gordon H. Guyatt,et al.  Users' Guides to the Medical Literature: V. How to Use an Article About Prognosis , 1994 .

[21]  C. McConkey,et al.  Cyclin/proliferation cell nuclear antigen immunohistochemistry does not improve the prognostic power of Dukes' or Jass' classifications for colorectal cancer , 1995, The British journal of surgery.

[22]  L I Iezzoni,et al.  A clinical assessment of MedisGroups. , 1988, JAMA.

[23]  R. Deyo,et al.  Adapting a clinical comorbidity index for use with ICD-9-CM administrative databases. , 1992, Journal of clinical epidemiology.

[24]  A. Laupacis,et al.  Clinical prediction rules. A review and suggested modifications of methodological standards. , 1997, JAMA.

[25]  S J Senn,et al.  Covariate imbalance and random allocation in clinical trials. , 1989, Statistics in medicine.

[26]  N. Christakis,et al.  Attitude and self-reported practice regarding prognostication in a national sample of internists. , 1998, Archives of internal medicine.

[27]  Jacobs Cm,et al.  MEDISGRPS: a clinically based approach to classifying hospital patients at admission. , 1985 .

[28]  L. Påhlman,et al.  Can mortality from rectal and rectosigmoid carcinoma be predicted from histopathological variables in the diagnostic biopsy? , 1989, APMIS : acta pathologica, microbiologica, et immunologica Scandinavica.

[29]  A. Phillips,et al.  Clinical staging system for AIDS patients , 1995, The Lancet.

[30]  J. Ptacek,et al.  Breaking Bad News: A Review of the Literature , 1996 .

[31]  Frank Davidoff,et al.  Predicting Clinical States in Individual Patients , 1996, Annals of Internal Medicine.

[32]  F. Vecchio The pathologist's role in the diagnosis and therapy of rectal cancer. , 1995, Rays.

[33]  Daniel B. Mark,et al.  TUTORIAL IN BIOSTATISTICS MULTIVARIABLE PROGNOSTIC MODELS: ISSUES IN DEVELOPING MODELS, EVALUATING ASSUMPTIONS AND ADEQUACY, AND MEASURING AND REDUCING ERRORS , 1996 .

[34]  H. L. Smith,et al.  The role of functional status in predicting inpatient mortality with AIDS: a comparison with current predictors. , 1996, Journal of clinical epidemiology.

[35]  I. Talbot,et al.  New grade‐related prognostic variable for rectal cancer , 1995, The British journal of surgery.

[36]  A. Feinstein,et al.  Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. , 1978, The New England journal of medicine.

[37]  F. Harrell,et al.  Regression modelling strategies for improved prognostic prediction. , 1984, Statistics in medicine.

[38]  A Hart,et al.  Evaluating black-boxes as medical decision aids: issues arising from a study of neural networks. , 1990, Medical informatics = Medecine et informatique.

[39]  J. Merz,et al.  How the Manner of Presentation of Data Influences Older Patients in Determining Their Treatment Preferences , 1993, Journal of the American Geriatrics Society.

[40]  R. Buckman How to break bad news : a guide for health care professionals , 1992 .

[41]  P Tugwell,et al.  Users' guides to the medical literature. V. How to use an article about prognosis. Evidence-Based Medicine Working Group. , 1994, JAMA.

[42]  R M Centor,et al.  Eualuating Physicians' Probabilistic Judgments , 1988, Medical decision making : an international journal of the Society for Medical Decision Making.

[43]  E. Fulcheri,et al.  Prognostic value of the Jass histopathologic classification in left colon and rectal cancer: a multivariate analysis. , 1990, Digestion.

[44]  A. O'Connor,et al.  Effects of framing and level of probability on patients' preferences for cancer chemotherapy. , 1989, Journal of clinical epidemiology.

[45]  John W. Tukey,et al.  Data Analysis and Regression: A Second Course in Statistics , 1977 .

[46]  E. Draper,et al.  APACHE II: A severity of disease classification system , 1985, Critical care medicine.

[47]  J R Clarke,et al.  Revised estimates of diagnostic test sensitivity and specificity in suspected biliary tract disease. , 1994, Archives of internal medicine.

[48]  R. vander Zwaag,et al.  From Dukes through Jass: pathological prognostic indicators in rectal cancer. , 1994, Human pathology.

[49]  N. Shepherd,et al.  p53 and Rb1 protein expression: are they prognostically useful in colorectal cancer? , 1997, British Journal of Cancer.

[50]  M. Tötsch,et al.  Immunohistochemically detectable bcl-2 expression in colorectal carcinoma: correlation with tumour stage and patient survival. , 1995, British Journal of Cancer.

[51]  A. Mocroft,et al.  Staging system for clinical AIDS patients , 1995, The Lancet.

[52]  Alvan R. Feinstein,et al.  XIV. The purposes of prognostic stratification , 1972 .

[53]  A. Feinstein,et al.  Spectrum Bias in the Evaluation of Diagnostic Tests: Lessons from the Rapid Dipstick Test for Urinary Tract Infection , 1992, Annals of Internal Medicine.

[54]  M. A. Judson Clinical Judgment , 1994, Annals of Internal Medicine.

[55]  M. Graffar [Modern epidemiology]. , 1971, Bruxelles medical.

[56]  G A Diamond,et al.  What price perfection? Calibration and discrimination of clinical prediction models. , 1992, Journal of clinical epidemiology.

[57]  R. Holcombe Informed consent, cancer, and truth in prognosis. , 1994, The New England journal of medicine.

[58]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[59]  T. G. Parks,et al.  Jass' classification revisited. , 1994, Journal of the American College of Surgeons.

[60]  Road,et al.  The Spread of Rectal Cancer and its Effect on Prognosis , 1958, British Journal of Cancer.

[61]  J. Habbema,et al.  The measurement of performance in probabilistic diagnosis. II. Trustworthiness of the exact values of the diagnostic probabilities. , 1978, Methods of information in medicine.

[62]  A. Ehrenberg,et al.  The Design of Replicated Studies , 1993 .

[63]  A. Feinstein Clinical biostatistics. XIV. The purposes of prognostic stratification. , 1972, Clinical pharmacology and therapeutics.

[64]  William H. Barker,et al.  CLINICAL EPIDEMIOLOGY: THE ESSENTIALS. , 1984 .

[65]  P Peduzzi,et al.  Importance of events per independent variable in proportional hazards analysis. I. Background, goals, and general strategy. , 1995, Journal of clinical epidemiology.

[66]  I. Talbot,et al.  Matrix metalloprotease 2 (MMP-2) and matrix metalloprotease 9 (MMP-9) type IV collagenases in colorectal cancer. , 1996, Cancer research.

[67]  M. Caruso,et al.  Carcinoma and synchronous hyperplastic polyps of the large bowel. , 1994, Pathologica.

[68]  N. Christakis Prognostication and death in medical thought and practice , 1995 .

[69]  A R Feinstein,et al.  A new prognostic staging system for the acquired immunodeficiency syndrome. , 1989, The New England journal of medicine.

[70]  B. Fisher,et al.  Relative prognostic value of the Dukes and the Jass systems in rectal cancer , 1989, Diseases of the colon and rectum.

[71]  V. Stone,et al.  The relation between hospital experience and mortality for patients with AIDS. , 1992, JAMA.

[72]  C. Mackenzie,et al.  A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. , 1987, Journal of chronic diseases.

[73]  H. Sox,et al.  Clinical prediction rules. Applications and methodological standards. , 1985, The New England journal of medicine.