Diagnostic tests are of crucial importance in health care. They are performed to reduce uncertainty concerning whether a patient has a condition of interest. A thorough evaluation of diagnostic tests is necessary to ensure that only accurate tests are used in practice. Diagnostic accuracy studies are a vital step in this evaluation process. Diagnostic accuracy studies aim to investigate how well the results from a test being evaluated (index test) agree with the results of the reference standard. The reference standard is considered the best available method to establish the presence or absence of a condition (target condition). In a classic diagnostic accuracy study, a consecutive series of patients who are suspected of having the target condition undergo the index test; then, all patients are verified by the same reference standard. The index test and reference standard are then read by persons blinded to the results of each, and various measures of agreement are calculated (for example, sensitivity, specificity, likelihood ratios, and diagnostic odds ratios). This classic design has many variations, including differences in the way patients are selected for the study, in test protocol, in the verification of patients, and in the way the index test and reference standard are read. Some of these differences may bias the results of a study, whereas others may limit the applicability of results. Bias is said to be present in a study if distortion is introduced as a consequence of defects in the design or conduct of a study. Therefore, a biased diagnostic accuracy study will produce estimates of test performance that differ from the true performance of the test. In contrast, variability arises from differences among studies, for example, in terms of population, setting, test protocol, or definition of the target disorder (1). Although variability does not lead to biased estimates of test performance, it may limit the applicability of results and thus is an important consideration when evaluating studies of diagnostic accuracy. The distinction between bias and variation is not always straightforward, and the use of different definitions in the literature further complicates this issue. For example, when a diagnostic study starts by including patients who have already received a diagnosis of the target condition and uses a group of healthy volunteers as the control group, it is likely that both sensitivity and specificity will be higher than they would be in a study made up of patients only suspected of having the target condition. This feature has been described as spectrum bias. However, strictly speaking, one could argue that it is a form of variability; sensitivity and specificity have been measured correctly within the study and thus there is no bias; however, the results cannot be applied to the clinical setting. In other words, they lack generalizability (2). Others have argued that when the goal of a study is to measure the accuracy of a test in the clinical setting, an error in the method of patient selection is made that will lead to biased estimates of test performance. They use a broader definition of bias and take into account the underlying research question when deciding whether results are biased. In this paper, we use a more restricted definition of bias. Our goal is to classify the various sources of variation and bias, describe their effects on test results, and provide a summary of the available evidence that supports each source of bias and variation (Table 1). For this purpose, we conducted a systematic review of all studies in which the main focus was examine the effects of one or more sources of bias or variation on estimates of test performance. Table 1. Description of Sources of Bias and Variation Methods Literature Searches We searched MEDLINE, EMBASE, BIOSIS and the methodologic databases of the Centre for Reviews and Dissemination and the Cochrane Collaboration from database inception to 2001. Search terms included sensitivit*, mass-screening, diagnostic-test, laboratory-diagnosis, false positive*, false negative*, specificit*, screening, accuracy, predictive value*, reference value*, likelihood ratio', sroc, and receiver operat* characteristic*. We also identified papers that had cited the key papers. Complete details of the search strategy are provided elsewhere (3). We contacted methodologic experts and groups conducting work in this field. Reference lists of retrieved articles were screened for additional studies. Inclusion Criteria All studies with the main objective of addressing bias or variation in the results of diagnostic accuracy studies were eligible for inclusion. Studies of any design, including reviews, and any topic area were eligible. Studies had to investigate the effects of bias or variation on measures of test performance, such as sensitivity, specificity, predictive values, likelihood ratios, and diagnostic odds ratios, and indicate how a particular feature may distort these measures. Inclusion was assessed by one reviewer and checked by a second reviewer; discrepancies were resolved through discussion. Data Extraction One reviewer extracted data and a second reviewer checked data on the following parameters: study design, objective, sources of bias or variation investigated, and the results for each source. Discrepancies were resolved by consensus or consultation with a third reviewer. Data Synthesis We divided the different sources of bias and variation into groups (Table 1). Table 1 provides a brief description of each source of bias and variation; more detailed descriptions are available elsewhere (3). Results were stratified according to the source of bias or variation. Studies were grouped according to study design. We classified studies that used actual data from one or more clinical studies to demonstrate the effect of a particular study feature as experimental studies, diagnostic accuracy studies, or systematic reviews. Experimental studies were defined as studies specifically designed to test a hypothesis about the effect of a certain feature, for example, rereading sets of radiographs while controlling (manipulating) the overall prevalence of abnormalities. Studies that used models to simulate how certain types of biases may affect estimates of diagnostic test performance were classified as modeling studies. These studies were considered to provide theoretical evidence of bias or variation. Role of the Funding Source The funding source was not involved in the design, conduct, or reporting of the study or in the decision to submit the manuscript for publication. Data Synthesis The literature searches identified a total of 8663 references. Of these, 569 studies were considered potentially relevant and were assessed for inclusion; 55, published from 1963 to 2000, met inclusion criteria. Nine studies were systematic reviews, 16 studies used an experimental design, 22 studies were diagnostic accuracy studies, and 8 studies used modeling to investigate the theoretical effects of bias or variation. Population Demographic Features Ten studies assessed the effects of demographic features on test performance (Table 2) (4, 5, 7, 9, 11, 14, 15, 20, 22, 24). Eight studies were diagnostic accuracy studies, and 2 were systematic reviews. All but one study (22) found an association between the features investigated and overall accuracy. The study that did not find an association investigated whether estimates of exercise testing performance differed between men and women; after correction for the effects of verification bias, no significant differences were found (22). Table 2. Population In general, the studies found associations between the demographic factors investigated and sensitivity; the reported effect on specificity was less strong. Four studies found that various factors, including sex, were associated with sensitivity but showed no association with specificity (4, 5, 11, 20). The index tests investigated in these studies were exercise testing (5, 11, 20) to diagnose heart disease and body mass index to test for obesity (4). Two additional studies of exercise testing also reported an association with sensitivity, but the effects on specificity differed. One found that factors that lead to increased sensitivity also lead to a decrease in specificity (14); the second reported higher sensitivity and specificity in men than in women (16). A study of the diagnostic accuracy of an alcohol screening questionnaire found that overall accuracy was increased in certain ethnic groups (24). Sex was the most commonly investigated variable. Three studies found no association between test performance and sex, 9 found significant effects on sensitivity, and 4 found significant effects on specificity. Other variables shown to have significant effects on test performance were age, race, and smoking status. Disease Severity Six studies looked at the effects of disease severity on test performance (Table 2) (5, 11, 14, 19, 23, 25). Three studies were diagnostic accuracy studies, 2 were reviews, and one used modeling to investigate the effects of differences in disease severity. The modeling study also included an example from a diagnostic accuracy study of tests for the diagnosis of ovarian cancer (25). Three studies investigated tests for heart disease (5, 11, 14), one examined ventilationperfusion lung scans for diagnosing pulmonary embolism (23), and one investigated 2 different laboratory tests (one for cancer and the other for bacterial infections) (19). All 6 studies found increased sensitivity with more severe disease; 5 found no effect on specificity (5, 11, 14, 19, 23), and one did not comment on the effects on specificity (25). Disease Prevalence Six studies looked at the effects of increased disease prevalence on test performance (Table 2) (8, 10, 13, 17, 21, 26). One study used an experimental design (8); the other studies were all diagnostic accuracy studies. The te
[1]
H E Rockette,et al.
Does knowledge of the clinical history affect the accuracy of chest radiograph interpretation?
,
1990,
AJR. American journal of roentgenology.
[2]
S. Cantor,et al.
Ethnic and Sex Bias in Primary Care Screening Tests for Alcohol Use Disorders
,
1998,
Annals of Internal Medicine.
[3]
N Segnan,et al.
Inter-observer and intra-observer variability of mammogram interpretation: a field study.
,
1992,
European journal of cancer.
[4]
C. Dennis,et al.
The Electrocardiographic Exercise Test in a Population with Reduced Workup Bias: Diagnostic Performance, Computerized Interpretation, and Multivariable Prediction
,
1998,
Annals of Internal Medicine.
[5]
C. Pichard,et al.
Body mass index compared to dual-energy x-ray absorptiometry: evidence for a spectrum bias.
,
1997,
Journal of clinical epidemiology.
[6]
J. W. Henry,et al.
Stratification of patients according to prior cardiopulmonary disease and probability assessment based on the number of mismatched segmental equivalent perfusion defects. Approaches to strengthen the diagnostic value of ventilation/perfusion lung scans in acute pulmonary embolism.
,
1993,
Chest.
[7]
R A Greenes,et al.
Assessment of diagnostic tests when disease verification is subject to selection bias.
,
1983,
Biometrics.
[8]
P Deneef,et al.
Evaluating Rapid Tests for Streptococcal Pharyngitis
,
1987,
Medical decision making : an international journal of the Society for Medical Decision Making.
[9]
R. Pettigrew,et al.
The importance of work-up (verification) bias correction in assessing the accuracy of SPECT thallium-201 testing for the diagnosis of coronary artery disease.
,
1996,
Journal of clinical epidemiology.
[10]
G. Arana,et al.
The effect of diagnostic methodology on the sensitivity of the TRH stimulation test for depression: A literature review
,
1990,
Biological Psychiatry.
[11]
J. Seward,et al.
Sex and test verification bias. Impact on the diagnostic value of exercise echocardiography.
,
1997,
Circulation.
[12]
A R Feinstein,et al.
The Limited Spectrum of Patients Studied in Exercise Test Research: Analyzing the Tip of the Iceberg
,
1982
.
[13]
C Lenfant,et al.
NHLBI funding policies. Enhancing stability, predictability, and cost control.
,
1994,
Circulation.
[14]
J. Soler‐Soler,et al.
Diagnostic accuracy of technetium-99m-MIBI myocardial SPECT in women and men.
,
1998,
Journal of nuclear medicine : official publication, Society of Nuclear Medicine.
[15]
A R Feinstein,et al.
The impact of clinical history on mammographic interpretations.
,
1997,
JAMA.
[16]
J. Elmore,et al.
Variability in radiologists' interpretations of mammograms.
,
1994,
The New England journal of medicine.
[17]
D. Ransohoff,et al.
Diagnostic Workup Bias in the Evaluation of a Test
,
1982,
Medical decision making.
[18]
V M Haughton,et al.
The effect of clinical bias on the interpretation of myelography and spinal computed tomography.
,
1982,
Radiology.
[19]
P. Bossuyt,et al.
Empirical evidence of design-related bias in studies of diagnostic tests.
,
1999,
JAMA.
[20]
A. Verbeek,et al.
Problems in selecting the adequate patient population from existing data files for assessment studies of new diagnostic tests.
,
1995,
Journal of clinical epidemiology.
[21]
R. Detrano,et al.
Methodologic problems in exercise testing research. Are we solving them?
,
1988,
Archives of internal medicine.
[22]
C E Phelps,et al.
Estimating Diagnostic Test Accuracy Using a "Fuzzy Gold Standard"
,
1995,
Medical decision making : an international journal of the Society for Medical Decision Making.
[23]
D. Rennie,et al.
The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration.
,
2003,
Annals of internal medicine.
[24]
R J Panzer,et al.
Workup Bias in Prediction Research
,
1987,
Medical decision making : an international journal of the Society for Medical Decision Making.
[25]
V. Hachinski,et al.
Fallacies in the pathological confirmation of the diagnosis of Alzheimer’s disease
,
1998,
Journal of neurology, neurosurgery, and psychiatry.
[26]
X H Zhou,et al.
Effect of verification bias on positive and negative predictive values.
,
1994,
Statistics in medicine.
[27]
A. Taube,et al.
Over- and underestimation of the sensitivity of a diagnostic malignancy test due to various selections of the study population.
,
1990,
Acta oncologica.
[28]
M. Schreiber,et al.
The clinical history as a factor in roentgenogram interpretation.
,
1963,
JAMA.
[29]
A. Feinstein,et al.
Clinical Epidemiology: The Architecture of Clinical Research.
,
1987
.
[30]
D Mulvihill,et al.
Exercise-induced ST segment depression in the diagnosis of multivessel coronary disease: a meta analysis.
,
1989,
Journal of the American College of Cardiology.
[31]
A. Feinstein,et al.
Problems of spectrum and bias in evaluating the efficacy of diagnostic tests.
,
1978,
The New England journal of medicine.
[32]
G. Ronco,et al.
Estimating the sensitivity of cervical cytology: errors of interpretation and test limitations
,
1996,
Cytopathology : official journal of the British Society for Clinical Cytology.
[33]
A. Smit,et al.
ROC analysis of noninvasive tests for peripheral arterial disease.
,
1996,
Ultrasound in medicine & biology.
[34]
K S Berbaum,et al.
Impact of clinical history on fracture detection with radiography.
,
1988,
Radiology.
[35]
R. Detrano,et al.
Factors affecting sensitivity and specificity of a diagnostic test: the exercise thallium scintigram.
,
1988,
The American journal of medicine.
[36]
K S Berbaum,et al.
Impact of clinical history on radiographic detection of fractures: a comparison of radiologists and orthopedists.
,
1989,
AJR. American journal of roentgenology.
[37]
E. Verdonschot,et al.
Factors involved in validity measurements of diagnostic tests for approximal caries--a meta-analysis.
,
1995,
Caries research.
[38]
P Doubilet,et al.
Interpretation of radiographs: effect of clinical history.
,
1981,
AJR. American journal of roentgenology.
[39]
H. Melbye,et al.
The spectrum of patients strongly influences the usefulness of diagnostic tests for pneumonia.
,
1993,
Scandinavian journal of primary health care.
[40]
The impact of adjusting for post-test referral bias on apparent sensitivity and specificity of SPECT myocardial perfusion imaging in men and women
,
1998
.
[41]
M. Cohen,et al.
Pathology and probability. Likelihood ratios and receiver operating characteristic curves in the interpretation of bronchial brush specimens.
,
1995,
American journal of clinical pathology.
[42]
P M Bossuyt,et al.
Effect of study design on the association between nuchal translucency measurement and Down syndrome.
,
1999,
Obstetrics and gynecology.
[43]
F. Harrell,et al.
Factors affecting sensitivity and specificity of exercise electrocardiography. Multivariable analysis.
,
1984,
The American journal of medicine.
[44]
A. Feinstein,et al.
Spectrum Bias in the Evaluation of Diagnostic Tests: Lessons from the Rapid Dipstick Test for Urinary Tract Infection
,
1992,
Annals of Internal Medicine.
[45]
G. Diamond,et al.
Comparison of the sensitivity and specificity of exercise electrocardiography in biased and unbiased populations of men and women.
,
1995,
American heart journal.
[46]
L. A. Thibodeaul.
Evaluating Diagnostic Tests
,
1981
.
[47]
P. Bossuyt,et al.
Development and validation of methods for assessing the quality of diagnostic accuracy studies.
,
2004,
Health technology assessment.
[48]
B. Ljung,et al.
Influence of training and experience in fine-needle aspiration biopsy of breast. Receiver operating characteristics curve analysis.
,
1987,
Archives of pathology & laboratory medicine.
[49]
R. Detrano,et al.
The diagnostic accuracy of the exercise electrocardiogram: a meta-analysis of 22 years of research.
,
1989,
Progress in cardiovascular diseases.
[50]
Diederick E. Grobbee,et al.
Limitations of Sensitivity, Specificity, Likelihood Ratio, and Bayes' Theorem in Assessing Diagnostic Probabilities: A Clinical Example
,
1997,
Epidemiology.
[51]
A R Feinstein,et al.
Context bias. A problem in diagnostic radiology.
,
1996,
JAMA.
[52]
D. Berman,et al.
The declining specificity of exercise radionuclide ventriculography.
,
1983,
The New England journal of medicine.
[53]
A. Cuarón,et al.
Interobserver variability in the interpretation of myocardial images with Tc-99m-labeled diphosphonate and pyrophosphate.
,
1980,
Journal of nuclear medicine : official publication, Society of Nuclear Medicine.
[54]
A. Detsky,et al.
The effect of spectrum bias on the utility of magnetic resonance imaging and evoked potentials in the diagnosis of suspected multiple sclerosis
,
1996,
Neurology.
[55]
S S Raab,et al.
Effect of clinical history on diagnostic accuracy in the cytologic interpretation of bronchial brush specimens.
,
2000,
American journal of clinical pathology.
[56]
J. C. Christiansen,et al.
Determinants of sensitivity and specificity of electrocardiographic criteria for left ventricular hypertrophy.
,
1990,
Circulation.
[57]
D. Dail,et al.
Reproducibility of the histologic diagnosis of pneumonia among a panel of four pathologists: analysis of a gold standard.
,
1997,
Chest.