The tuberculin skin test has many potential sources of error and variability. Standardization of the tuberculin reagent and the meaning of the test results have been considered in some detail [1, 2], but little attention has been paid to the reading itself [3-10]. Measurement of the induration, however, is one of the most important potential sources of error. If the customary technique of palpation is used, the margins of the induration may be difficult to define. The alternative ballpoint-pen method, although advocated as more reliable than palpation [3], has not been discussed in official statements on tuberculosis [1, 2]. We investigated the reliability of the ballpoint-pen technique and compared this technique with the palpation method. Methods Patients and Procedures Patients and health care personnel who were in an internal medicine department and needed a tuberculin skin test were invited to participate. Persons who had received bacille CalmetteGurin vaccine were enrolled preferentially. Ninety-six persons who provided informed consent ultimately participated in the study. Tuberculin Skin Tests and Measurement Methods Ten units of tuberculin from Pasteur Merieux, Lyon, France (corresponding to the recommended 5 IU of purified protein derivative tuberculin), were injected intradermally on the volar surface of the forearm (Mantoux technique) [11]. Readings were done on the third day after the test was administered, and the diameter of induration was measured along the long axis of the forearm. Two experienced investigators each independently did three measurements. The first two measurements were taken with a blinded caliper using the ballpoint-pen technique [3]. With this technique, a medium-point ballpoint pen is used to draw a line starting 1 to 2 cm away from the skin reaction and moving toward its center. When the pen reaches the margin of the induration, an increased resistance to further movement is felt and the pen is lifted. The procedure is repeated on the opposite side of the skin reaction. The distance between the ends of the opposing lines at the margins of the induration is measured. In our study, the lines were erased and the measurement process was repeated. The lines were then erased again, and the third measurement was done by palpation [2]. To reproduce the usual conditions of testing, we used a flexible ruler. The data were collected during eight sessions; 11 to 14 participants were tested per session. To reduce the chance that an observer would remember previous readings, three things were done. First, the results of measures that were obtained with the blinded caliper were recorded by a third investigator. Second, the first ballpoint-pen measure was done for all participants at each session, then the second ballpoint-pen measure, and then the palpation measure. Third, before the second and third readings, the third investigator verified that no minor landmarks persisted. Statistical Analysis To analyze the reliability of quantitative data, we used statistical methods that have been described elsewhere [12]. Intraclass correlation coefficients and their 95% CIs were computed using SAS soft-ware (SAS Institute, Cary, North Carolina) [13]. Induration diameters were used to classify skin reactions as positive or negative according to the 5-, 10-, and 15-mm cutoff points that have been recommended as indicating positivity in various situations [2]. Reliability was then assessed with coefficients [14]. We also used a graphical analysis that focuses on the mean and the variation in the differences between repeated measurements [15]. Mean differences and the SD of the differences were calculated. An area of imprecision that was determined on the basis of the SD of the differences was placed around the arbitrarily chosen 10-mm cutoff value (10 mm 1.96 SD). If a first measurement fell within this area, particularly at or about the cutoff value, the likelihood that the second measurement would be sufficiently different to change the result of the tuberculin skin test from negative to positive (or vice versa) was high. Conversely, such reclassification would occur in only 5% of the cases that had values outside this area. Results Because of the study design, only 27 participants (28%) did not react to the tuberculin skin test. Reliability of the Ballpoint-Pen Technique Intraobserver Reliability In persons who had no response to the tuberculin skin test, the intraobserver reliability was perfect (intraclass correlation coefficient = 1.0). Intraclass correlation coefficients were high for both observers and decreased only slightly after the nonresponders were excluded. The coefficients also suggested good intraobserver reliability but were lower with the 10- and 15-mm cutoff values than with the 5-mm cutoff value (Table 1). Table 1. Reliability Study of the Ballpoint-Pen and Palpation Methods of Induration Measurement for the Tuberculin Skin Test The top panel of Figure 1 shows the difference between the two readings for each participant that were done by the first observer (range, 6.8 to +3.5 mm) plotted against the corresponding mean for each participant. The level of intraobserver reliability was evaluated by determining the 95% CI ( 2.68 to +2.96 mm) within which most of the differences were seen. This means that 5% of the time, the second measure of the test results done by using the ballpoint-pen method would be at least 2.7 mm less than or 3.0 mm more than the first one. This lack of reliability could lead to the reclassification of a negative tuberculin skin test result as positive or vice versa. Figure 1. Top. Middle. Bottom. As shown in the top panel of Figure 1, an area of imprecision that straddles the cutoff value (7.2 to 12.8 mm for a 10-mm cutoff value) was generated using the SD of the differences. Test results for 8 of the 69 patients (12%) were reclassified. The first measurement for 30 of the 69 patients (43.5%) fell within this area of imprecision; 7 of those 30 patients (23.3%) were among the 8 patients whose test results were reclassified. Interobserver Reliability Agreement between observers, estimated by using the intraclass correlation and coefficients, was high (Table 1). The first ballpoint-pen measures made by the two observers were used for these analyses. Differences between first measures done by the two observers were between 5.1 and +7.3 mm (Figure 1, middle). The 95% CI of the differences was 3.39 to +3.69 mm; this means that 5% of the time, the result of a second tuberculin skin test measurement by another investigator would be at least 3.4 mm more than or 3.7 mm less than that of a first investigator. As in the top panel of Figure 1, an area of imprecision (6.5 to 13.5 mm) is shown in the middle panel of Figure 1; this area is slightly broader than that calculated for intraobserver reliability. Test results for 8 of the 69 patients (12%) were reclassified. The first measurement for 40 of the 69 patients (58%) fell within this area of imprecision; 7 of those 40 patients (17.5%) were among the 8 patients whose results were reclassified. Reliability of the Palpation Technique Except for the coefficients at the 15-mm cutoff, assessment of agreement between observers showed that all reliability coefficients obtained with the palpation technique were slightly lower than those obtained with the ballpoint-pen method (Table 1). The 95% CI of the differences between the measures of the two observers was 4.6 to +5.2 mm (Figure 1, bottom). This resulted in a much broader area of imprecision for the readings (5.1 to 14.9 mm). Test results were reclassified for 12 of the 69 patients (17.4%). The first measure of 43 of the 69 patients (62.3%) fell within this area of imprecision, and the 12 patients whose test results were reclassified were among those 43 (27.9%). Agreement between Ballpoint-Pen and Palpation Methods Although all the intraclass correlation coefficients were high, the coefficients that were produced after persons with no response to the test were excluded suggested only moderate to good reliability (Table 1). The 95% CIs of the differences between the first ballpoint-pen and the palpation measures were 3.0 to +4.1 mm for readings taken by the first observer and 2.5 to +3.9 mm for readings taken by the second observer. The areas of imprecision for the measurements were from 6.4 to 13.6 mm for readings taken by the first observer and 6.8 to 13.2 mm for readings taken by the second observer. Reclassification occurred in 8 of 69 patients (12%) for both observers. Discussion In our study, the ballpoint-pen technique was reliable, as evaluated by global reliability coefficients. However, the graphical analysis provided a more meaningful representation of the level of variation. Intraobserver reliability may be the most important factor for such diagnostic tests as the tuberculin skin test, which are usually done by only one examiner for any given patient. Lack of reliability may lead to the frequent reclassification of results, particularly if readings are at or about the cutoff values. Reliability coefficients were slightly higher for the ballpoint-pen technique than for the palpation method. In addition, the 95% CI of the differences of the measures taken by the two observers was 38% broader for the palpation method than for the ballpoint-pen technique; this could result in more frequent misclassification. Only one study [10] has addressed the interobserver reliability of the ballpoint-pen technique. That study relied on simple correlation coefficients to determine reliability. Reanalysis of the data from that study provided a coefficient of 0.74 (using a cutoff point of 10 mm). Previous studies of the reliability of the palpation method [5, 6, 10] have also been restricted primarily to the assessment of interobserver agreement and have provided conflicting results. Recalculation from the data of one large survey of six studies on tuberculin skin testing [4] gave
[1]
D. Snider,et al.
Treatment of tuberculosis and tuberculosis infection in adults and children. American Thoracic Society and The Centers for Disease Control and Prevention.
,
1994,
American journal of respiratory and critical care medicine.
[2]
R. Huebner,et al.
The tuberculin skin test.
,
1993,
Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.
[3]
Jacob Cohen.
A Coefficient of Agreement for Nominal Scales
,
1960
.
[4]
J. Porter,et al.
Preventive therapy for tuberculosis in HIV-infected persons: international recommendations, research, and practice
,
1995,
The Lancet.
[5]
N. Siafakas,et al.
Palpation vs pen method for the measurement of skin tuberculin reaction (Mantoux test).
,
1991,
Chest.
[6]
D. T. Carr.
The tuberculin skin test.
,
1972,
The American review of respiratory disease.
[7]
Guidelines on the management of tuberculosis and HIV infection in the United Kingdom. Subcommittee of the Joint Tuberculosis Committee of the British Thoracic Society.
,
1992,
BMJ.
[8]
J. Bearman,et al.
A STUDY OF VARIABILITY IN TUBERCULIN TEST READING.
,
2015,
The American review of respiratory disease.
[9]
Lloyd N. Friedman,et al.
Diagnostic standards and classification of tuberculosis.
,
1991,
The American review of respiratory disease.
[10]
N. Siafakas,et al.
The role of inexperience in measuring tuberculin skin reaction (Mantoux test) by the pen or palpation technique.
,
1992,
Respiratory medicine.
[11]
P. Edwards,et al.
Experimental error in the determination of tuberculin sensitivity.
,
1951,
Public health reports.
[12]
D. Altman,et al.
STATISTICAL METHODS FOR ASSESSING AGREEMENT BETWEEN TWO METHODS OF CLINICAL MEASUREMENT
,
1986,
The Lancet.
[13]
D. Solomon,et al.
Reading the tuberculin skin test. Who, when, and how?
,
1988,
Archives of internal medicine.
[14]
J. Fleiss,et al.
Intraclass correlations: uses in assessing rater reliability.
,
1979,
Psychological bulletin.
[15]
J. Sokal.
Measurement of Delayed Skin-Test Responses
,
1975
.
[16]
T. Jordan,et al.
Tuberculin reaction size measurement by the pen method compared to traditional palpation.
,
1987,
Chest.