Review of the medical record is an integral part of evaluating the quality and appropriateness of inpatient care (for example, by payers, hospitals, professional organizations, and researchers) [1-4]. Often the initial review of the patients' charts is based on predetermined criteria (explicit review) and does not require a physician reviewer. However, the complexity and heterogeneity of care on general medicine services makes it impractical to develop valid preset criteria for most aspects of care provided on these services [1, 2]. Therefore, most quality and utilization review for general medical inpatients relies on the opinions of peers as to whether they believe the care was appropriate (implicit review). Structured implicit review is a process whereby an expert reviewer judges specific aspects of patient care [1, 3, 5]. By serially focusing the reviewer's attention on important aspects of care and by obtaining implicit judgments about this care, the reliability and the validity of the review are improved. This process was pioneered by Butler and Quinlan [6] and has been refined by researchers at the RAND Corporation [3, 5, 7] for specific medical diseases. Concurrent with our study, Rubin and colleagues [8] independently developed a structured, implicit review instrument for diverse medical and surgical conditions. Implicit review by peers (structured or not) is generally considered the community standard for final quality decisions [3, 4]. However, given that practice norms vary widely, can peers agree on the bounds of appropriate care? A recent review [9] of the literature found a paucity of empiric information about the reliability of peer judgments. Little information has been reported on the reliability of peers' judgments about most aspects of care for general medicine inpatients (for example, specific quality problems and appropriateness of resource use). We evaluated implicit review for measuring quality of care and appropriateness of resource use on general medicine wards. We determined how often poor care was identified, the types of reported quality problems, and the level of agreement between different physician reviewers (inter-rater reliability). Methods Patients All patients were in one of four general medicine services at a large university teaching hospital. Each ward team consisted of an attending physician, a resident, and two or three interns, who rotated on the service for 1 month. The patient mix was heterogeneous; no single diagnostic-related group contributed more than 5% of patient admissions. A review of a random sample of admissions, occurring from January 1988 to June 1990, was supplemented with oversampling of patient groups of particular interest; deaths, those having readmission within 28 days of discharge, and patients who had an increase in their APACHE-L score (a severity-of-illness measure composed of the laboratory section of the original APACHE score) [10] during hospitalization. We also sampled patients for whom hospital satisfaction data were available (a survey of consecutive patients discharged alive during a 6-month period). Overall, charts from 675 patient admissions were reviewed (5% of charts selected for review could not be obtained for review): 425 patient admissions were selected randomly with replacement and 250 patient admissions were selected randomly to oversample deaths, early readmissions, in-hospital increases in APACHE-L scores, and the patient satisfaction survey. This sampling strategy was designed to allow for detailed analysis of patient groups of special interest for an evaluation of quality screens and of the impact of a clinical management intervention. Approximately 20% of charts from patient admissions (n = 171) were randomly selected for multiple independent reviews to allow for reliability testing. Although reviewers knew that the completeness and reliability of their reviews were being tested, they did not know which patients had their charts selected for multiple reviews. The 12 reviewers worked varying numbers of hours on the study and are not equally represented in the reliability analyses. However, all but one reviewer reviewed between 25 to 45 of the 171 patient admission charts selected for multiple reviews. The remaining reviewer dropped out of the study early on and reviewed only nine of the reliability charts. Excluding this reviewer from the analyses would not affect our results. The 171 patient admissions selected for inter-rater reliability testing of the reviewers had a mean age of 52 years (SD, 20 years), 20% of patients died in the hospital (as mentioned above, deaths were oversampled), and they were from 39 different diagnostic-related groups (no more than 12 patient admissions were in any one diagnostic-related group). The median number of reviewers per case was three. Structured Implicit Review Instrument Using the knowledge gained from structured implicit review for specific medical diseases, we developed a structured, implicit review instrument to evaluate quality of care on a general medicine ward. This instrument included 10 questions about specific aspects of the quality of inpatient care, 5 questions about the appropriateness of resource use, and 3 questions about outpatient care before admission. For patients who died in the hospital, reviewers were asked if their deaths were preventable by better quality of care. The overall evaluation of quality of medical care was rated on a 6-point scale (1 = superior, 2 = excellent, 3 = good, 4 = adequate, 5 = substandard, 6 = poor). Most of the other items were measured on 5-point scales (for example, 1 = definitely adequate, 2 = probably adequate, 3 = unsure, 4 = probably not adequate, 5 = definitely not adequate). For all measures, more appropriate care was represented by a lower scale score. When reviewers were unsure, they were asked to explain the reason for their uncertainty (borderline medical care was provided, handwriting illegible, inadequate documentation in the chart, or other). Reviews were done by 12 board-certified internists (7 fellows and 5 faculty) who had reputations for being excellent clinicians at the study hospital. All reviewers had recent or current extensive experience in general medicine inpatient care and had trained in diverse settings; only four had trained at the study hospital. All reviewers had 15 to 20 hours of training before beginning chart review. In a 90-minute initial instruction session, we reviewed the abstracting form and written instructions for completion of the implicit review process. Reviewers were given examples of what we considered substantial and unimportant deviations from standard of care for each item on the review form. We discussed the possible effect of poor outcomes on a reviewer's judgment [12]. Reviewers were instructed not to second-guess reasonable judgments using hindsight and to concentrate on the quality of the process of care. They were asked to evaluate whether they believed the care received was appropriate, regardless of whether the patient had a good or bad outcome. We were only interested in clinically important quality problems that put the patient at substantial risk of a poor outcome. Each reviewer was required to give a written description of any alleged quality problem and to record the specific risk for the patient. Reviewers were not to consider inefficient care or unnecessary use of hospital resources in their evaluation of overall quality, unless it resulted in undue patient risk (for example, an unnecessary angiogram). After the initial training session, all reviewers were assigned to review the same 15 to 20 pilot charts. Small group meetings, with three to four reviewers, were then held to discuss their ratings of these patient admission charts for each category of the chart-review instrument. The instructions given in the original meeting were reiterated in detail. Then preliminary assessments of the reviewer's reliability and thoroughness were evaluated, and 4 of the 12 reviewer trainees were assigned more pilot charts for additional training. Once chart review had begun in earnest, reviewers could still direct questions about the review form to the chart review supervisor (RAH). This supervisor examined reviewers' written comments and abstract forms to check for inconsistencies or incomplete data and contacted reviewers when necessary. Every 2 months the supervisor (RAH) contacted all chart reviewers. He asked the reviewers if they were having any problems with the reviews and reinforced training (by reviewing the definitions of the review instrument subcategories and reiterating the training instructions). Copies of the review instrument and the instructions to reviewers are available on request. Statistical Analysis Weighted statistics were calculated to quantify the inter-rater reliabilities of reviewers' implicit judgments. Kappas were calculated on 171 charts that had multiple reviews producing 368 separate comparisons. In evaluating inter-rater reliabilities, it is important to distinguish between reliabilities necessary for assessment of each patient and those needed for assessment of an aggregate of patients. For most aggregate comparisons using means or proportions (such as comparing mean scores on two hospital wards), moderate reliabilities of 0.5 or 0.6 are usually acceptable [11]. However, reliabilities must often be high to make a confident assessment of each patient ( 0.80) [11]. However, the statistic is only a summary of the amount of agreement and does not directly relay the predictiveness of a judgment. Therefore, we calculated the sensitivity and specificity of a single review using the mean review of the other reviewers as the standard (using a method described by Rubin and colleagues [8]). This analysis estimates the likelihood that a single review would correctly classify the patient compared with multiple other reviews (cutoff point for substandard care was a mean s
[1]
Quality in health care.
,
1992,
JAMA.
[2]
M. A. Spiteri,et al.
RELIABILITY OF ELICITING PHYSICAL SIGNS IN EXAMINATION OF THE CHEST
,
1988,
The Lancet.
[3]
H Pohl,et al.
Medication prescribing errors in a teaching hospital.
,
1990,
JAMA.
[4]
K N Lohr,et al.
Medicare: A Strategy for Quality Assurance
,
1991,
Journal of quality assurance : a publication of the National Association of Quality Assurance Professionals.
[5]
R. Brook,et al.
Hospital inpatient mortality. Is it a predictor of quality?
,
1987,
The New England journal of medicine.
[6]
P. Dans,et al.
Peer review organizations. Promises and potential pitfalls.
,
1985,
The New England journal of medicine.
[7]
K. Lohr,et al.
Monitoring quality of care in the Medicare program. Two proposed systems.
,
1987,
JAMA.
[8]
A. Donabedian,et al.
The Criteria and Standards of Quality
,
1980
.
[9]
L. McMahon,et al.
APACHE-L: A New Severity of Illness Adjuster for Inpatient Medical Care
,
1992,
Medical care.
[10]
D Draper,et al.
Changes in quality of care for five diseases measured by implicit review, 1981 to 1986.
,
1990,
JAMA.
[11]
J. Butler,et al.
Internal audit in the department of medicine of a community hospital; two years' experience.
,
1958,
Journal of the American Medical Association.
[12]
K. Kahn,et al.
Structured Implicit Review for Physician Implicit Measurement of Quality of Care
,
1989
.
[13]
R H Brook,et al.
Watching the doctor-watchers. How well do peer review organization methods detect hospital care quality problems?
,
1992,
JAMA.
[14]
R Williams,et al.
Inter-observer variation of symptoms and signs in jaundice.
,
2008,
Liver.
[15]
B. Hulka,et al.
Peer review in ambulatory care: use of explicit criteria and implicit judgments.
,
1979,
Medical care.
[16]
T. Delbanco,et al.
Incidence and characteristics of preventable iatrogenic cardiac arrests.
,
1991,
JAMA.
[17]
K L Posner,et al.
Effect of outcome on physician judgments of appropriateness of care.
,
1991,
JAMA.
[18]
T A Brennan,et al.
Reliability and Validity of Judgments Concerning Adverse Events Suffered by Hospitalized Patients
,
1989,
Medical care.
[19]
A. Donabedian.
The definition of quality and approaches to its assessment
,
1980
.
[20]
R H Brook,et al.
Preventable deaths: who, how often, and why?
,
1988,
Annals of internal medicine.
[21]
R. Goldman,et al.
The reliability of peer assessments of quality of care.
,
1992,
JAMA.