GRADE: assessing the quality of evidence for diagnostic recommendations

Making a diagnosis is the bread and butter of clinical practice, but in light of the number of tests now available to clinician, diagnosing illness has become a complicated process. Guidelines for making an evidence-based diagnosis abound, but those making recommendations about diagnostic tests or test strategies must realize that clinicians require support to make diagnostic decisions that they can easily implement in daily practice. The Grading of Recommendations Assessment, Development and Evaluation (GRADE) Working Group has developed a rigorous, transparent, and increasingly adopted approach for grading the quality of research evidence and strength of recommendations to guide clinical practice. This editorial summarizes GRADE's process for developing recommendations for tests (1). Clinicians are trained to use tests for screening and diagnosis; identifying physiologic derangements; establishing a prognosis; and monitoring illness and treatment response by assessing signs and symptoms, imaging, biochemistry, pathology, and psychological testing techniques (2). Sensitivity, specificity, positive predictive value, likelihood ratios, and diagnostic odds ratios are among the challenging terms that diagnostic studies typically deliver to clinicians, and all have to do with diagnostic accuracy. Not only do clinicians have difficulties remembering the definitions and calculations for these terms, application of the concepts to individual patients is often complicated. Many clinicians order a test despite uncertainty about how to interpret the result, and they also contribute to testing errors by incorrectly ordering tests (3, 4). GRADE's framework for developing recommendations for diagnostic management studies is based on what is needed for practical clinical applicationthat is, how to weigh the benefits and harms of ordering and using a diagnostic test in caring for patients (1). The approach begins with specifying the PICO: the relevant population (P), diagnostic intervention or test (I) (including its purpose, such as triage, replacement, or an add-on test), comparison test (C), and patient-important outcomes (O) related to the use of a test for a focused clinical question. If a test fails to improve patient-important outcomes, there is no reason to use it, whatever its accuracy. For example, the results of genetic testing for Huntington chorea, an untreatable condition, may provide either welcome reassurance that a patient will not have the condition or the ability to plan for his future knowing that he will sadly fall victim (1). Here, the ability to plan is analogous to an effective treatment, and the benefits of planning need to be balanced against the downsides of receiving an early diagnosis (5-7). The best evidence of test performance comes from large randomized trials of diagnostic strategies that directly measure patient-important outcomes (1). However, these trials are few and far between: An informal review of the Cochrane database of randomized trials revealed <100 such studies. Therefore, most recommendations about diagnostic testing are based on an implicit 2-step process of how the accuracy of a test indirectly changes patient-important outcomes. In the first step, a diagnostic-test accuracy study (Figure), patients may receive both a new test and a reference test (i.e., the best available method for detecting the target condition). Investigators can then calculate the accuracy of the test compared with the reference test (first step). In the second step, judgments about the patient importance of test accuracy are based on the consequences of being correctly or incorrectly classified as having or not having the disease. These include the benefits and harms of receiving treatment or follow-up tests for those correctly classified as having the disease, reassurance or receipt of other follow-up tests for those correctly classified as not having the disease, receipt of unnecessary treatment or additional tests for those incorrectly classified as having the disease, delayed or no treatment for those incorrectly classified as not having the disease, and any adverse effects of the diagnostic test (e.g., from invasive tests). Those making recommendations about diagnostic tests must then compare patient-important outcomes (and costs) in all patients receiving the new test with all patients receiving the old, or comparator, test. For the first step (i.e., assessing test accuracy), there are well-described methodological criteria for assessing risk for bias in an estimate of test accuracy, ideally based on a systematic review of relevant studies. For instance, studies of diagnostic test accuracy with a low risk for bias enroll consecutive patients for whom there is legitimate diagnostic uncertaintythat is, the type of patients to whom clinicians would apply the test in the course of regular clinical practice. If studies fail this criterion (e.g., only enroll patients with severe disease and healthy controls), the apparent accuracy of a test is likely to be misleadingly high (8, 9). The second step shown in the Figure is, in most situations, based on judgments of test accuracy as a surrogate for patient-important outcomes. The key issue about these judgments is that they should be made transparent to those using the recommendations. For example, in the diagnosis of suspected acute urolithiasis, well-designed studies demonstrate fewer false-negative results with noncontrast helical computed tomography (CT) than with intravenous pyelography (IVP) (10). However, those ureteric stones that CT detects but IVP misses are smaller and therefore are more likely to spontaneously pass. Before randomized trials evaluating outcomes in patients treated for smaller stones, evidence from observational studies was of lower quality. Thus, it remained uncertain how patients were affected by missed cases and follow-up of incidental findings unrelated to renal calculi with CT. Recommendations about using one test (IVP) over the other (helical CT) were based on judgments of how the cases that were detected or missed would fare with or without treatment (11). These judgments were likely to be based on indirect evidence and would be less certain than judgments based on direct evidence from a randomized trial comparing the 2 tests. The GRADE approach requires making these judgments about the relation between accuracy and patient-important outcomes transparent. The example of IVP versus helical CT for patients with suspected acute shows exemplifies how the quality of evidence for an accurate test would be downgraded because of the lack of direct evidence on patient-important outcomes. Uncertainty about patient-important consequences and associated uncertainty about benefits and harms would probably have resulted in weak GRADE recommendations about the use of IVP compared with helical CT. Those making recommendations using the GRADE approach should also explicitly consider judgments and evidence about the values and preferences that patients attach to important consequences, as described more fully elsewhere (1). Acknowledgments: This work was partially funded by a The human factor, mobility and Marie Curie Actions Scientist Reintegration European Commission Grant (IGR 42192)GRADE to Dr. Schnemann.

[1]  Johannes B Reitsma,et al.  Evidence of bias and variation in diagnostic accuracy studies , 2006, Canadian Medical Association Journal.

[2]  Joseph B. Martin Huntington's disease , 1984, Neurology.

[3]  A. Worster,et al.  Does replacing intravenous pyelography with noncontrast helical computed tomography benefit patients with suspected acute urolithiasis? , 2002, Canadian Association of Radiologists journal = Journal l'Association canadienne des radiologistes.

[4]  P. Bossuyt,et al.  Empirical evidence of design-related bias in studies of diagnostic tests. , 1999, JAMA.

[5]  A. Haeringen,et al.  Paradox of a better test for Huntington's disease , 2000, Journal of neurology, neurosurgery, and psychiatry.

[6]  S. Dovey,et al.  Testing process errors and their harms and consequences reported from family medicine practices: a study of the American Academy of Family Physicians National Research Network , 2008, Quality & Safety in Health Care.

[7]  Jonathan J Deeks,et al.  Systematic reviews in health care: Systematic reviews of evaluations of diagnostic and screening tests. , 2001, BMJ.

[8]  T. Gray,et al.  Learning needs in clinical biochemistry for doctors in foundation years , 2008, Annals of clinical biochemistry.

[9]  S. Wiggins,et al.  Psychological consequences and predictors of adverse events in the first 5 years after predictive testing for Huntington's disease , 2003, Clinical genetics.

[10]  B. Weaver,et al.  The accuracy of noncontrast helical computed tomography versus intravenous pyelography in the diagnosis of suspected acute urolithiasis: a meta-analysis. , 2002, Annals of emergency medicine.

[11]  J. Sch GRADE: grading quality of evidence and strength of recommendations for diagnostic tests and strategies , 2008, BMJ : British Medical Journal.