Single vs. double data entry.

To the Editor: Epidemiologic research has to struggle with the necessity of extensive data collection despite limited financial resources. High numbers of participants and detailed questionnaires, which are often not designed for automatic data capture, are typical. Double entry of data, in combination with subsequent or simultaneous data comparison and creation of a final dataset, is state-of-the-art in clinical trials and has been recommended for epidemiologic studies. However, double data entry substantially increases costs compared with single data entry. Our Medline search could not identify reports assessing the data quality achieved by single versus double data entry in epidemiologic studies under real conditions. Therefore, we investigated the amount and sources of error occurring during single data entry and the potential improvement by double data entry, within the context of an ongoing multicenter environmental cohort study. We compared 2 databases resulting from single data entry with the reference dataset (created by double entry, followed by comparison and correction of errors), using all records for August through October 2003. We defined a data entry error as a deviation between the single entry databases and the reference database in a character or digit of any database field. We took into account the type of questionnaire (interview, self-administered) and type of variable (closed questions: dichotomous, categorical; open questions: continuous, text field). To calculate the error rate, we divided the number of observed deviations by the total number of database fields. In addition, we investigated the reasons for errors in a random sample of about 10% of all observed discrepancies. Overall, the observed error rates varied between 0.54% and 0.72% (Table). The error rates of open plain text fields were higher (ranging from 1.2% to 2.1%) than for open continuous variables (0.2%–0.8%) or closed dichotomous variables (0.4%–0.5%). Most of the errors (72%) originated from interpretation problems by the data entry staff. These arose mostly from additional handwritten comments of the fieldworkers or study participants, or from incorrect questionnaire completion such as multiple ticks in questions where only one tick was allowed. Classic mistakes, such as shifting in input line or mistyping, led to only 28% of the observed errors. Quality assurance measures, such as training of fieldworkers (eg, standardized interview performance) and data entry staff (eg, specifications on how to handle frequent problems), as well as monitoring of the completed questionnaires, contribute to lower error rates in data entry. In our study with adequately trained and experienced staff, the overall error rate of single data entry was only slightly higher than 0.5%. Under these conditions, double data entry would only marginally enhance data quality, but would increase time and costs of data entry substantially. Not only twice the time for entering the data has to be taken into account, but also the time necessary for programming the comparisons of databases, working through the documentation to explore deviations, and performing the corrections. Expressed in monetary terms, double data entry increases cost by a factor of about 2.5 in comparison with single data entry. Although software programs can integrate the comparison of data in the second data entry, these software solutions are costly and time-consuming. In conclusion, in times of increasingly limited financial resources, it may be worthwhile to consider single data entry with concomitant quality control as an option to enhance cost-effective

[1]  Roberta F. White,et al.  Consequences of exposure measurement error for confounder identification in environmental epidemiology , 2003, Statistics in medicine.

[2]  B Rosner,et al.  Correction of logistic regression relative risk estimates and confidence intervals for systematic within-person measurement error. , 2006, Statistics in medicine.

[3]  H. Sørensen,et al.  Does Appendectomy Reduce the Risk of Ulcerative Colitis? , 2004, Epidemiology.

[4]  C. Dalgård,et al.  Mercury in the Umbilical Cord: Implications for Risk Assessment for Minamata Disease. , 1994, Environmental health perspectives.

[5]  F. N. David,et al.  LINEAR STATISTICAL INFERENCE AND ITS APPLICATION , 1967 .

[6]  P. Grandjean,et al.  Methylmercury dose estimation from umbilical cord concentrations in patients with Minamata disease. , 1998, Environmental research.

[7]  M. Longnecker,et al.  Fish Intake During Pregnancy and Early Cognitive Development of Offspring , 2004, Epidemiology.

[8]  W. Meschino,et al.  Ethical, legal, and practical concerns about recontacting patients to inform them of new information: the case in medical genetics. , 2001, American journal of medical genetics.

[9]  P. Marshall,et al.  Ethical Challenges in Community‐Based Research , 2001, The American journal of the medical sciences.

[10]  D. Rampton Appendicectomy in ulcerative colitis , 1999, The Lancet.

[11]  B. Rosner,et al.  Measurement error correction in nutritional epidemiology based on individual foods, with application to the relation of diet to breast cancer. , 2001, American journal of epidemiology.

[12]  D Spiegelman,et al.  Correcting for bias in relative risk estimates due to exposure measurement error: a case study of occupational exposure to antineoplastics in pharmacists. , 1998, American journal of public health.

[13]  D. Bellinger Assessing environmental neurotoxicant exposures and child neurobehavior: confounded by confounding? , 2004, Epidemiology.

[14]  Elazar J. Pedhazur,et al.  Measurement, Design, and Analysis: An Integrated Approach , 1994 .

[15]  Jack T Dennerlein,et al.  Using “Exposure Prediction Rules” for Exposure Assessment: An Example on Whole-Body Vibration in Taxi Drivers , 2004, Epidemiology.

[16]  S Day,et al.  Double data entry: what value, what price? , 1998, Controlled clinical trials.

[17]  M. Frisch,et al.  Appendectomy in Adulthood and the Risk of Inflammatory Bowel Diseases , 2002, Scandinavian journal of gastroenterology.

[18]  J. Baum,et al.  The water content of the human umbilical cord. , 1979, Early human development.

[19]  M. Burgess Beyond consent: ethical and social issues in genetic testing , 2001, Nature Reviews Genetics.

[20]  P. Grandjean,et al.  Measuring mercury concentration. , 2005, Epidemiology.

[21]  D Gibson,et al.  Is double data entry necessary? The CHART trials. CHART Steering Committee. Continuous, Hyperfractionated, Accelerated Radiotherapy. , 1994, Controlled clinical trials.

[22]  R. Biggar,et al.  Appendectomy and protection against ulcerative colitis. , 2001, The New England journal of medicine.

[23]  Bartha Maria Knoppers,et al.  Rationale for an integrated approach to genetic epidemiology. , 1992, Bioethics.

[24]  J. Olsen,et al.  Appendectomy and subsequent risk of inflammatory bowel diseases. , 2001, Surgery.

[25]  M. Goldacre,et al.  Computerised linking of medical records: methodological guidelines. , 1993, Journal of epidemiology and community health.

[26]  B Rosner,et al.  Correction of logistic regression relative risk estimates and confidence intervals for measurement error: the case of multiple covariates measured with error. , 1990, American journal of epidemiology.

[27]  Kyle Steenland,et al.  A Practical Guide to Dose-Response Analyses and Risk Assessment in Occupational Epidemiology , 2004, Epidemiology.

[28]  B. Bellach,et al.  Leitlinien und Empfehlungen zur Sicherung von Guter Epidemiologischer Praxis (GEP) Eine Mitteilung der Arbeitsgruppe Epidemiologische Methoden der Deutschen Arbeitsgemeinschaft Epidemiologie (DAE) , 2000, Bundesgesundheitsblatt - Gesundheitsforschung - Gesundheitsschutz.

[29]  S. Kreiner,et al.  Incidence and prevalence of ulcerative colitis in Copenhagen county from 1962 to 1987. , 1991, Scandinavian journal of gastroenterology.

[30]  B. Lind,et al.  Quality assurance and quality control in longitudinal studies. , 1998, Epidemiologic reviews.