Systematic Reviews of Diagnostic Test Accuracy

Diagnosis is a critical component of health care, and clinicians, policymakers, and patients routinely face a range of questions regarding diagnostic tests. They want to know whether testing improves outcome; what test to use, purchase, or recommend in practice guidelines; and how to interpret test results. Well-designed diagnostic test accuracy studies can help in making these decisions, provided that they transparently and fully report their participants, tests, methods, and results as facilitated, for example, by the STARD (Standards for Reporting of Diagnostic Accuracy) statement (1). That 25-item checklist was published in many journals and is now adopted by more than 200 scientific journals worldwide. As in other areas of science, systematic reviews and meta-analysis of accuracy studies can be used to obtain more precise estimates when small studies addressing the same test and patients in the same setting are available. Reviews can also be useful to establish whether and how scientific findings vary by particular subgroups, and may provide summary estimates with a stronger generalizability than estimates from a single study. Systematic reviews may help identify the risk for bias that may be present in the original studies and can be used to address questions that were not directly considered in the primary studies, such as comparisons between tests. The Cochrane Collaboration is the largest international organization preparing, maintaining, and promoting systematic reviews to help people make well-informed decisions about health care (2). The Collaboration decided in 2003 to make preparations for including systematic reviews of diagnostic test accuracy in their Cochrane Database of Systematic Reviews. To enable this, a working group (Appendix). was formed to develop methodology, software, and a handbook The first diagnostic test accuracy review was published in the Cochrane Database in October 2008. In this paper, we review recent methodological developments concerning problem formulation, location of literature, quality assessment, and meta-analysis of diagnostic accuracy studies by using our experience from the work on the Cochrane Handbook. The information presented here is based on the recent literature and updates previously published guidelines by Irwig and colleagues (3). Definition of the Objectives of the Review Diagnostic test accuracy refers to the ability of a test to distinguish between patients with disease (or more generally, a specified target condition) and those without. In a study of test accuracy, the results of the test under evaluation, the index test, are compared with those of the reference standard determined in the same patients. The reference standard is an agreed-on and accurate method for identifying patients who have the target condition. Test results are typically categorized as positive or negative for the target condition. By using such binary test outcomes, the accuracy is most often expressed as the test's sensitivity (the proportion of patients with positive results on the reference standard that are also positive on the index test) and specificity (the proportion of patients with negative results on the reference standard that are also negative on the index test). Other measures have been proposed and are in use (46). It has long been recognized that test accuracy is not a fixed property of a test. It can vary between patient subgroups, with their spectrum of disease, with the clinical setting, or with the test interpreters and may depend on the results of previous testing. For this reason, inclusion of these elements in the study question is essential. In order to make a policy decision to promote use of a new index test, evidence is required that using the new test increases test accuracy over other testing options, including current practice, or that the new test has equivalent accuracy but offers other advantages (79). As with the evaluation of interventions, systematic reviews need to include comparative analyses between alternative testing strategies and should not focus solely on evaluating the performance of a test in isolation. In relation to the existing situation, 3 possible roles for a new test can be defined: replacement, triage, and add-on (7). If a new test is to replace an existing test, then comparing the accuracy of both tests on the same population and with the same reference standard provides the most direct evidence. In triage, the new test is used before the existing test or testing pathway, and only patients with a particular result on the triage test continue the testing pathway. When a test is needed to rule out disease in patients who then need no further testing, a test that gives a minimal proportion of falsenegative results and thus a relatively high sensitivity should be used. Triage tests may be less accurate than existing ones, but they have other advantages, such as simplicity or low cost. A third possible role of a new test is add-on. The new test is then positioned after the existing testing pathway to identify false-positive or false-negative results after the existing pathway. The review should provide data to assess the incremental change in accuracy made by adding the new test. An example of a replacement question can be found in a systematic review of the diagnostic accuracy of urinary markers for primary bladder cancer (10). Clinicians may use cytology to triage patients before they undergo invasive cystoscopy, the reference standard for bladder cancer. Because cytology combines high specificity with low sensitivity (11), the goal of the review was to identify a tumor marker with sufficient accuracy to either replace cytology or be used in addition to cytology. For a marker to replace cytology, it has to achieve equally high specificity with improved sensitivity. New markers that are sensitive but not specific may have roles as adjuncts to conventional testing. The review included studies in which the test under evaluation (several different tumor markers and cytology) was evaluated against cystoscopy or histopathology. Included studies compared 1 or more of the markers, cytology only, or a combination of markers and cytology. Although information on accuracy can help clinicians make decisions about tests, good diagnostic accuracy is a desirable but not sufficient condition for the effectiveness of a test (8). To demonstrate that using a new test does more good than harm to patients tested, randomized trials of test-and-treatment strategies and reviews of such trials may be necessary. However, with the possible exception of screening, in most cases, such randomized trials are not available and systematic reviews of test accuracy may provide the most useful evidence available to guide clinical and health policy decision making and use as input for decision and cost-effectiveness analysis (12). Identification and Selection of Studies Identifying test accuracy studies is more difficult than searching for randomized trials (13). There is not a clear, unequivocal keyword or indexing term for an accuracy study in literature databases comparable with the term randomized, controlled trial. The Medical Subject Heading sensitivity and specificity may look suitable but is inconsistently applied in most electronic bibliographic databases. Furthermore, data on diagnostic test accuracy may be hidden in studies that did not have test accuracy estimation as their primary objective. This complicates the efficient identification of diagnostic test accuracy studies in electronic databases, such as MEDLINE. Until indexing systems properly code studies of test accuracy, searching for them will remain challenging and may require additional manual searches, such as screening reference lists. In the development of a comprehensive search strategy, review authors can use search strings that refer to the test(s) under evaluation, the target condition, and the patient description or a subset of these. For tests with a clear name that are used for a single purpose, searching for publications in which those tests are mentioned may suffice. For other reviews, adding the patient description may be necessary, although this is also often poorly indexed. A search strategy in MEDLINE should contain both Medical Subject Headings and free text words. A search strategy for articles about tests for bladder cancer, for example, should include as many synonyms for bladder cancer as possible in the search strategy, including neoplasm, carcinoma, transitional cell, and hematuria. Several methodological electronic search filters for diagnostic test accuracy studies have been developed, each attempting to restrict the search to articles that are most likely to be test accuracy studies (1316). These filters rely on indexing terms for research methodology and text words used in reporting results, but they often miss relevant studies and are unlikely to decrease the number of articles one needs to screen. Therefore, they are not recommended for systematic reviews (17, 18). The incremental value of searching in languages other than English and in the gray literature has not yet been fully investigated. In systematic reviews of intervention studies, publication bias is an important and well-studied form of bias in which the decision to report and publish studies is linked to their findings. For clinical trials, the magnitude and determinants of publication bias have been identified by tracing the publication history of cohorts of trials reviewed by ethics committees and research boards (19). A consistent observation has been that studies with significant results are more likely to be published than studies with nonsignificant findings (19). Investigating publication bias for diagnostic tests is problematic, because many studies are done without ethical review or study registration; therefore, identification of cohorts of studies from registration to final publication status i

[1]  M. Weinstein,et al.  Decision Making in Health and Medicine , 2001 .

[2]  Lucas M Bachmann,et al.  Communicating accuracy of tests to general practitioners: a controlled study , 2002, BMJ : British Medical Journal.

[3]  M. Correale,et al.  Comparison of nuclear matrix protein 22 and bladder tumor antigen in urine of patients with bladder cancer. , 1998, Anticancer Research.

[4]  P. Bossuyt Interpreting diagnostic test accuracy studies. , 2008, Seminars in hematology.

[5]  Lucas M. Bachmann,et al.  Research Paper: Identifying Diagnostic Studies in MEDLINE: Reducing the Number Needed to Read , 2002, J. Am. Medical Informatics Assoc..

[6]  Xiao-Hua Zhou,et al.  Statistical Methods in Diagnostic Medicine , 2002 .

[7]  Milo Puhan,et al.  A Randomized Trial of Ways To Describe Test Accuracy: The Effect on Physicians' Post-Test Probability Estimates , 2005, Annals of Internal Medicine.

[8]  Johannes B Reitsma,et al.  Evidence of bias and variation in diagnostic accuracy studies , 2006, Canadian Medical Association Journal.

[9]  A. Agarwal,et al.  NMP22 is a sensitive, cost-effective test in patients at risk for bladder cancer. , 1999, The Journal of urology.

[10]  Petra Macaskill,et al.  Empirical Bayes estimates generated in a hierarchical summary ROC analysis agreed closely with those of a full Bayesian analysis. , 2004, Journal of clinical epidemiology.

[11]  R. Brian Haynes,et al.  Developing optimal search strategies for detecting clinically sound studies in MEDLINE. , 1994, Journal of the American Medical Informatics Association : JAMIA.

[12]  K. H. Lee Evaluation of the NMP22 test and comparison with voided urine cytology in the detection of bladder cancer. , 2001, Yonsei medical journal.

[13]  Patrick M M Bossuyt,et al.  Exploring sources of heterogeneity in systematic reviews of diagnostic tests , 2002, Statistics in medicine.

[14]  Lucas M Bachmann,et al.  Sample sizes of studies on diagnostic accuracy: literature survey , 2006, BMJ : British Medical Journal.

[15]  T. Gasser,et al.  Urinary level of nuclear matrix protein 22 in the diagnosis of bladder cancer: experience with 130 patients with biopsy confirmed tumor. , 2000, The Journal of urology.

[16]  C. Gatsonis,et al.  Designing studies to ensure that estimates of test accuracy are transferable , 2002, BMJ : British Medical Journal.

[17]  Marije Deutekom,et al.  Tumor markers in the diagnosis of primary bladder cancer. A systematic review. , 2003, The Journal of urology.

[18]  P. Bossuyt,et al.  Empirical evidence of design-related bias in studies of diagnostic tests. , 1999, JAMA.

[19]  J. Sterne,et al.  Accuracy of magnetic resonance imaging for the diagnosis of multiple sclerosis: systematic review , 2006, BMJ : British Medical Journal.

[20]  David Moher,et al.  Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. Standards for Reporting of Diagnostic Accuracy. , 2003, Clinical chemistry.

[21]  C M Rutter,et al.  A hierarchical regression approach to meta‐analysis of diagnostic test accuracy evaluations , 2001, Statistics in medicine.

[22]  P. Glasziou,et al.  Identifying studies for systematic reviews of diagnostic tests was difficult due to the poor sensitivity and precision of methodologic filters and the lack of information in the abstract. , 2005, Journal of clinical epidemiology.

[23]  M. Paoluzzi,et al.  Urinary dosage of nuclear matrix protein 22 (NMP22) like biologic marker of transitional cell carcinoma (TCC): a study on patients with hematuria. , 1999, Archivio italiano di urologia, andrologia : organo ufficiale [di] Societa italiana di ecografia urologica e nefrologica.

[24]  Sarah Lord,et al.  When Is Measuring Sensitivity and Specificity Sufficient To Evaluate a Diagnostic Test, and When Do We Need Randomized Trials? , 2006, Annals of Internal Medicine.

[25]  P. Bossuyt,et al.  Use of methodological search filters to identify diagnostic accuracy studies can lead to the omission of relevant studies. , 2006, Journal of clinical epidemiology.

[26]  Ben Ewald,et al.  Post hoc choice of cut points introduced bias to diagnostic research. , 2006, Journal of clinical epidemiology.

[27]  A. K. Agarwal,et al.  Exclusion criteria enhance the specificity and positive predictive value of NMP22 and BTA stat. , 1999, The Journal of urology.

[28]  P D Bezemer,et al.  Publications on diagnostic test evaluation in family medicine journals: an optimal search strategy. , 2000, Journal of clinical epidemiology.

[29]  B. Liu,et al.  Sensitivity and specificity of NMP-22, telomerase, and BTA in the detection of human bladder cancer. , 1999, Urology.

[30]  Alex J Sutton,et al.  Asymmetric funnel plots and publication bias in meta-analyses of diagnostic accuracy. , 2002, International journal of epidemiology.

[31]  Jonathan J Deeks,et al.  The performance of tests of publication bias and other sample size effects in systematic reviews of diagnostic test accuracy was assessed. , 2005, Journal of clinical epidemiology.

[32]  C. Gatsonis,et al.  Meta-analysis of diagnostic and screening test accuracy evaluations: methodologic primer. , 2006, AJR. American journal of roentgenology.

[33]  C Gatsonis,et al.  Meta‐analysis of Diagnostic Test Accuracy Assessment Studies with Varying Number of Thresholds , 2003, Biometrics.

[34]  J R Thornbury,et al.  Eugene W. Caldwell Lecture. Clinical efficacy of diagnostic imaging: love it or leave it. , 1994, AJR. American journal of roentgenology.

[35]  A. Browning,et al.  Evaluation of the Clinical Value of Urinary NMP22 as a Marker in the Screening and Surveillance of Transitional Cell Carcinoma of the Urinary Bladder , 2001, European Urology.

[36]  P. Bossuyt,et al.  Impact of adjustment for quality on results of metaanalyses of diagnostic accuracy. , 2007, Clinical chemistry.

[37]  P. Bossuyt,et al.  BMC Medical Research Methodology , 2002 .

[38]  P. Bossuyt,et al.  The quality of diagnostic accuracy studies since the STARD statement , 2006, Neurology.

[39]  Haitao Chu,et al.  A unification of models for meta-analysis of diagnostic accuracy studies. , 2009, Biostatistics.

[40]  R. Haynes,et al.  Medline : analytical survey scientifically strong studies of diagnosis from Optimal search strategies for retrieving , 2004 .

[41]  C. Constantinides,et al.  Comparative evaluation of the diagnostic performance of the BTA stat test, NMP22 and urinary bladder cancer antigen for primary and recurrent bladder tumors. , 2001, The Journal of urology.

[42]  L E Moses,et al.  Combining independent studies of a diagnostic test into a summary ROC curve: data-analytic approaches and some additional considerations. , 1993, Statistics in medicine.

[43]  J. Sehouli,et al.  Original Paper: A likelihood ratio approach to meta-analysis of diagnostic studies , 2003 .

[44]  Ömer Öge,et al.  Evaluation of nuclear matrix protein 22 (NMP22) as a tumor marker in the detection of bladder cancer , 2004, International Urology and Nephrology.

[45]  Frank Buntinx,et al.  The evidence base of clinical diagnosis , 2008 .

[46]  H. Biri,et al.  Comparison of the Nuclear Matrix Protein 22 with Voided Urine Cytology and BTA stat Test in the Diagnosis of Transitional Cell Carcinoma of the Bladder , 1999, European Urology.

[47]  Penny Whiting,et al.  Bmc Medical Research Methodology Open Access No Role for Quality Scores in Systematic Reviews of Diagnostic Accuracy Studies , 2005 .

[48]  K. Bichler,et al.  Comparison of cytology and nuclear matrix protein 22 for the detection and follow-up of bladder cancer. , 2001, Urologia internationalis.

[49]  Frederick Mosteller,et al.  Guidelines for Meta-analyses Evaluating Diagnostic Tests , 1994, Annals of Internal Medicine.

[50]  P. Bossuyt,et al.  Sources of Variation and Bias in Studies of Diagnostic Accuracy , 2004, Annals of Internal Medicine.

[51]  M. Leeflang,et al.  Bias in sensitivity and specificity caused by data-driven selection of optimal cutoff values: mechanisms, magnitude, and solutions. , 2008, Clinical chemistry.

[52]  K. Khan,et al.  Systematic reviews of diagnostic tests: a guide to methods and application. , 2005, Best practice & research. Clinical obstetrics & gynaecology.

[53]  J. Kleijnen,et al.  Systematic reviews to evaluate diagnostic tests. , 2001, European journal of obstetrics, gynecology, and reproductive biology.

[54]  Johannes B Reitsma,et al.  Quality of reporting of diagnostic accuracy studies. , 2005, Radiology.

[55]  J. C. Houwelingen,et al.  Bivariate Random Effects Meta-Analysis of ROC Curves , 2008, Medical decision making : an international journal of the Society for Medical Decision Making.

[56]  V. Lokeshwar,et al.  Urinary bladder tumor markers. , 2006, Urologic oncology.

[57]  D J O'Kane,et al.  Comparison of screening methods in the detection of bladder cancer. , 1999, The Journal of urology.

[58]  Patrick M Bossuyt,et al.  We should not pool diagnostic likelihood ratios in systematic reviews , 2008, Statistics in medicine.

[59]  Penny F Whiting,et al.  How does study quality affect the results of a diagnostic meta-analysis? , 2005, BMC medical research methodology.

[60]  Johannes B Reitsma,et al.  Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. , 2005, Journal of clinical epidemiology.

[61]  K Koiso,et al.  Urinary nuclear matrix protein 22 as a new marker for the screening of urothelial cancer in patients with microscopic hematuria , 1999, International journal of urology : official journal of the Japanese Urological Association.

[62]  Paul Glasziou,et al.  Comparative accuracy: assessing new tests against existing diagnostic pathways , 2006, BMJ : British Medical Journal.

[63]  David Moher,et al.  Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. , 2004, Family practice.

[64]  Alex J. Sutton,et al.  Publication and related biases: a review , 2000 .

[65]  Optimal search strategies for retrieving scientifically strong studies of diagnosis from MEDLINE: analytical survey , 2004 .