Identifying Adverse Events Caused by Medical Care: Degree of Physician Agreement in a Retrospective Chart Review

Retrospective case review has long been a main-stay of peer review. It supports scientific studies and medical audits as well as assessments of the appropriateness, effectiveness, and quality of health care provided by physicians, hospitals, or regions. As part of quality assurance, hospitals and clinics regularly use formal and informal case review. Insurers and managed care organizations rely on case review when making decisions about coverage. All forms of case review depend heavily on expert opinion. Case review also underlies current and proposed systems of compensating patients for injuries caused by medical care. Under the current litigation system, the patient must prove, with the support of expert medical opinion, that medical care contributed to the injury (causation) and fell below the standards of practice in the community (negligence). Under proposed no-fault alternatives to litigation, entitlement to compensation and liability for payment might also depend on an expert's opinion as to whether the patient's outcome was caused by medical care rather than by a preexisting disease or condition [1-4]. Critics have identified several problems with case review. First, experts cannot form a consensus about which outcomes are adverse. Second, medical technology changes rapidly and creates uncertainty about the appropriateness and effectiveness of practices. Third, administrative or transaction costs in making individualized determinations of causation might be high [5-8]. The American College of Physicians [9] and others [10] have called for further demonstration projects. Building on previous research on the reliability of clinical judgments, we used a large sample of physician reviews of medical records to estimate the degree of agreement on the cause of adverse patient outcomes. We also discuss the implications of the results for quality assurance, performance assessment, and proposals for no-fault patient compensation. Methods Cases were obtained from the Medical Practice Study, a project designed to estimate the rate of adverse events occurring among inpatients in a random sample of 31 429 medical records from 51 health care facilities in New York State. We defined an adverse event as an injury that 1) was caused at least in part by medical management and 2) required or prolonged hospitalization or led to disability after discharge. The injury could result from a provider's action or inaction in either inpatient or outpatient settings or from a drug or medical device. The medical management did not have to be substandard or inappropriate; the injury could follow an unexpected complication. Adverse outcomes caused solely by underlying disease or by the intended consequences of treatment were not considered to be adverse events. For example, an injury to the recurrent laryngeal nerve during partial thyroidectomy (an unplanned and unintended but recognized complication) would be considered an adverse event, but the intentional destruction of the same nerve in a radical thyroid resection for cancer would not. A broken experimental balloon that led to an embolus and stroke during cardiac catheterization would, as a complication of treatment, also be an adverse event, especially if the patient's risk was unknown. This result would apply even in a study approved by a Human Subjects Committee. Other aspects of the Medical Practice Study and the general methods have been widely reported [11-15]. The following methods are relevant to our report. Record Review Records were reviewed in two stages. In stage 1 (which is not the subject of this report), nurses and medical records administrators used a single review per case to screen the entire sample of records for the presence of 1 or more of 18 explicit criteria (Figure 2). These criteria were based primarily on previous research [16] and were revised by the physician investigators of the Medical Practice Study. Although explicit, the criteria were broad and open to interpretation. The nurses and records administrators received an extensive manual, which contained detailed examples of the criteria, and 2 hours of focused classroom training from team leaders chosen for this project. To increase the efficiency and accuracy of screening, the nurses and records administrators used preprinted forms generated by the project management team. Nurses were instructed to refer any questionable cases for stage 2 review. Questions of a more general nature were referred to supervisors and then to the project office for consistent responses. The estimated negative predictive value of the screening was 99.5% [17]. Figure 2. Screening Criteria Implemented at Stage 1 Review by Nurses and Medical Records Administrators. Judgments on adverse events by pairs of physician-reviewers and rate of agreement on occurrence of adverse events compared with extreme disagreement. If a = cases of extreme disagreement (one reviewer scored the outcome as 0 [no possible adverse event] and the other scored the case as 4, 5, or 6) and b = cases for which both reviewers found adverse events (both scored the case as 4, 5, or 6), then the reported rate of agreement = a/(a + b). Bars represent exact binomial 95% CIs. Numbers in parentheses are the population-weighted estimates of the number of cases in New York State in 1984 that are represented by the sampled cases reported in this figure. In stage 2, each record that had or may have had at least one criterion present was further analyzed by two physicians who worked independently. Physicians were recruited primarily from New York State through a network of personal contacts of the study investigators. The physicians could not review records at the hospitals in which they practiced. Most were board certified in surgery (23%) or internal medicine (68%); the remaining were certified in obstetrics and gynecology, family practice, pediatrics, urology, or emergency medicine. Eighty-five percent were male. Most physicians were in the early stage of their careers: Fifty-five percent had received board certification within the 10 years before the study began. All physicians had telephone access to a panel of experts. A separate manual and a structured abstraction form guided the stage 2 review. As described previously [17], both were revised repeatedly after extensive pilot testing. This 65-page manual included explicit instructions on several types of adverse events. According to the manual, for example, all surgical wound infections were almost invariably adverse events, as were all falls and all drug reactions that prolonged hospitalization or caused disability. A 14-page abstraction form first asked the physician reviewer to assess whether an adverse event might have occurred. If the physician found no possible adverse event, the review was stopped and the case received a score of 0. If an adverse event might have occurred, the reviewer considered a list of factors on the cause of the injury and rated his or her confidence about the occurrence of an adverse event on an interval scale of 1 to 6 (Figure 1). For a confidence score of 2 (slight to modest evidence of an adverse event) or greater, the reviewer indicated the type of event (fall, drug reaction, wound infection, error of omission, or failure to diagnose), the number of additional days of hospitalization (if applicable), and the degree of disability over and above the underlying disease. Finally, the reviewers considered whether the error amounted to negligence. Within this structure, however, the physician could be discreet in judging the cause of the injury, hospitalization, or disability (a structured implicit review). All physician reviewers identified themselves by number, with the understanding that their confidential opinions would not be used for quality assurance, peer review, or litigation. Copies of the abstraction booklet are available from the authors. Figure 1. Our report focuses on the two independent expert opinions obtained during stage 2 review as to whether an adverse outcome identified during stage 1 had been caused at least in part by medical management. Results of each assessment of causation were linked to the patient's computerized discharge data summary to identify the patient's age, diagnosis, and discharge status. Statistical Analysis Agreement between Reviewers We calculated a rate of agreement between the two physician reviewers in each pair on adverse events using a statistic described by Grant [18] for assessing agreement on abnormal tracings from electronic fetal monitoring. In our application, the numerator of this statistic was the number of cases in which both reviewers assessed their confidence in an adverse event as more likely than not or greater. This assessment corresponded to a score of 4, 5, or 6. The denominator was the sum of the numerator and the number of cases of extreme disagreement, for which one reviewer scored the case as 4, 5, or 6 and the other physician found no possible adverse event (a score of 0). The statistic therefore compared the number of cases with agreed-upon adverse events with the number of clear disagreements. This statistic does not include cases for which both physicians agreed that no adverse event had occurred. It recognizes that agreement about whether a patient's condition is normal (no adverse event) is usually greater than agreement about whether a patient has disease or an abnormal condition [19-22]. The statistic is also not affected by the number of clearly normal cases in the samples of cases for review. In our study, the number of cases clearly without adverse events at stage 2 was influenced by the coarseness of the previous screening process. The stage 1 reviewers were cautioned to avoid false-negative determinations if they were in doubt, so that adverse events would not be overlooked. This statistic also facilitated comparisons of rates of agreement across subsets of such adverse events as drug reactions, which are

[1]  D. Nalin Adverse Events Associated with Childhood Vaccines: Evidence Bearing on Causality , 1994, The Yale Journal of Biology and Medicine.

[2]  E. Keeler,et al.  The effects of the DRG-based prospective payment system on quality of care for hospitalized medicare patients : Final Report , 1994 .

[3]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[4]  J. Bader,et al.  Agreement Among Dentists' Recommendations for Restorative Treatment , 1993, Journal of dental research.

[5]  G. Danan,et al.  Causality assessment of adverse reactions to drugs--I. A novel method based on the conclusions of international consensus meetings: application to drug-induced liver injuries. , 1993, Journal of clinical epidemiology.

[6]  R. Logan,et al.  Clinical Epidemiology: A Basic Science for Clinical Medicine , 1992 .

[7]  K. Kahn,et al.  Physician ratings of appropriate indications for three procedures: theoretical indications vs indications used in practice. , 1989, American journal of public health.

[8]  L. Tancredi,et al.  Obstetrics and malpractice. Evidence on the performance of a selective no-fault system. , 1991, JAMA.

[9]  J. M. Grant,et al.  The fetal heart rate trace is normal, isn't it? Observer agreement of categorical assessments , 1991, The Lancet.

[10]  J. Carlin,et al.  Bias, prevalence and kappa. , 1993, Journal of clinical epidemiology.

[11]  R. Penchansky,et al.  Targeting ambulatory care cases for risk management and quality management. , 1994, Inquiry : a journal of medical care organization, provision and financing.

[12]  R M Centor,et al.  Eualuating Physicians' Probabilistic Judgments , 1988, Medical decision making : an international journal of the Society for Medical Decision Making.

[13]  J. Elmore,et al.  A bibliography of publications on observer variability (final installment). , 1992, Journal of clinical epidemiology.

[14]  K L Posner,et al.  Effect of outcome on physician judgments of appropriateness of care. , 1991, JAMA.

[15]  A M Bernard,et al.  Evaluating the Care of General Medicine Inpatients: How Good Is Implicit Review? , 1993, Annals of Internal Medicine.

[16]  R. Brook,et al.  Diagnosis and Treatment of Coronary Disease , 1988 .

[17]  E K Harris,et al.  Use of the population distribution to improve estimation of individual means in epidemiological studies. , 1979, Journal of chronic diseases.

[18]  C. E. Davis,et al.  Empirical Bayes estimates of subgroup effects in clinical trials. , 1990, Controlled clinical trials.

[19]  L L Leape,et al.  The economic consequences of medical injuries. Implications for a no-fault insurance plan. , 1992, JAMA.

[20]  L. Chambless,et al.  A comparison of direct adjustment and regression adjustment of epidemiologic measures. , 1985, Journal of chronic diseases.

[21]  The Oversight of Medical Care: A Proposal for Reform , 1994, Annals of Internal Medicine.

[22]  D. Cowper,et al.  The Ratio of Observed‐to‐Expected Mortality as a Quality of Care Indicator in Non‐Surgical VA Patients , 1994, Medical care.

[23]  R. Wolfinger,et al.  Generalized linear mixed models a pseudo-likelihood approach , 1993 .

[24]  A. Feinstein,et al.  High agreement but low kappa: II. Resolving the paradoxes. , 1990, Journal of clinical epidemiology.

[25]  N M Laird,et al.  Hospital characteristics associated with adverse events and substandard care. , 1991, JAMA.

[26]  Douglas G. Altman,et al.  Practical statistics for medical research , 1990 .

[27]  L. Tancredi,et al.  "Medical Adversity Insurance"--a no-fault approach to medical malpractice and quality assurance. , 1973, The Milbank Memorial Fund quarterly. Health and society.

[28]  David A. Lane,et al.  Causal propositions in clinical research and practice. , 1992, Journal of clinical epidemiology.

[29]  L Lasagna,et al.  Adverse drug reactions—a matter of opinion , 1976, Clinical pharmacology and therapeutics.

[30]  R. Brook,et al.  Appropriateness of Care: A Comparison of Global and Outcome Methods to Set Standards , 1992 .

[31]  N M Laird,et al.  Relation between malpractice claims and adverse events due to negligence. Results of the Harvard Medical Practice Study III. , 1991, The New England journal of medicine.

[32]  L. Daly,et al.  Simple SAS macros for the calculation of exact binomial and Poisson confidence limits. , 1992, Computers in biology and medicine.

[33]  P. Graham,et al.  The analysis of ordinal agreement data: beyond weighted kappa. , 1993, Journal of clinical epidemiology.

[34]  L. Fielding,et al.  Identification of preventable trauma deaths: confounded inquiries? , 1992, The Journal of trauma.

[35]  S D Small,et al.  Incidence of adverse drug events and potential adverse drug events. Implications for prevention. ADE Prevention Study Group. , 1995, JAMA.

[36]  B. Fischhoff,et al.  Calibration of probabilities: the state of the art to 1980 , 1982 .

[37]  K. Kahn,et al.  Physician ratings of appropriate indications for six medical and surgical procedures. , 1986, American journal of public health.

[38]  Baruch Fischhoff,et al.  Calibration of Probabilities: The State of the Art , 1977 .

[39]  G. Casella An Introduction to Empirical Bayes Data Analysis , 1985 .

[40]  A. Flahault,et al.  Causality assessment of adverse reactions to drugs--II. An original model for validation of drug causality assessment methods: case reports with positive rechallenge. , 1993, Journal of clinical epidemiology.

[41]  H. D. de Vet,et al.  Sources of interobserver variation in histopathological grading of cervical dysplasia. , 1992, Journal of clinical epidemiology.

[42]  R H Brook,et al.  Preventable deaths: who, how often, and why? , 1988, Annals of internal medicine.

[43]  D. Musch,et al.  Some factors influencing interobserver variation in classifying simple pneumoconiosis. , 1985, British journal of industrial medicine.

[44]  T. Brennan,et al.  The nature of adverse events in hospitalized patients. Results of the Harvard Medical Practice Study II. , 1991, The New England journal of medicine.

[45]  R M Poses,et al.  The answer to “What are my chances, Doctor?” depends on whom is asked: Prognostic disagreement and inaccuracy for critically ill patients , 1989, Critical care medicine.

[46]  A R Feinstein,et al.  An algorithm for the operational assessment of adverse drug reactions. I. Background, description, and instructions for use. , 1979, JAMA.

[47]  S. Schroeder,et al.  Do bad outcomes mean substandard care? , 1991, Journal of the American Medical Association (JAMA).

[48]  D Draper,et al.  Changes in quality of care for five diseases measured by implicit review, 1981 to 1986. , 1990, JAMA.

[49]  C. Naranjo,et al.  Advances in the Diagnosis of Adverse Drug Reactions , 1992, Journal of clinical pharmacology.

[50]  R H Brook,et al.  Watching the doctor-watchers. How well do peer review organization methods detect hospital care quality problems? , 1992, JAMA.

[51]  F M Richardson,et al.  Peer Review of Medical Care , 1972, Medical care.

[52]  F. T. de Dombal,et al.  Radiological signs of ulcerative colitis: assessment of their reliability by means of observer variation studies. , 1968, Gut.

[53]  T. Brennan,et al.  INCIDENCE OF ADVERSE EVENTS AND NEGLIGENCE IN HOSPITALIZED PATIENTS , 2008 .

[54]  A R Feinstein,et al.  A bibliography of publications on observer variability. , 1985, Journal of chronic diseases.

[55]  E. Mackenzie,et al.  Inter-rater reliability of preventable death judgments. The Preventable Death Study Group. , 1992, The Journal of trauma.

[56]  J. Fleming Is There a Future for Tort , 1984 .

[57]  R. Goldman,et al.  The reliability of peer assessments of quality of care. , 1992, JAMA.

[58]  L. Koran,et al.  The reliability of clinical methods, data and judgments (second of two parts). , 1975, The New England journal of medicine.

[59]  M. Kramer Difficulties in assessing the adverse effects of drugs. , 1981, British journal of clinical pharmacology.

[60]  T. Brennan,et al.  Incidence of adverse events and negligence in hospitalized patients. , 1991, The New England journal of medicine.

[61]  S. Dippe,et al.  A peer review of a peer review organization. , 1989, The Western journal of medicine.

[62]  T. Hutchinson,et al.  An algorithm for the operational assessment of adverse drug reactions. III. Results of tests among clinicians. , 1979, JAMA.