Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research.

We evaluated a comprehensive deidentification engine at the University of Pittsburgh Medical Center (UPMC), Pittsburgh, PA, that uses a complex set of rules, dictionaries, pattern-matching algorithms, and the Unified Medical Language System to identify and replace identifying text in clinical reports while preserving medical information for sharing in research. In our initial data set of 967 surgical pathology reports, the software did not suppress outside (103), UPMC (47), and non-UPMC (56) accession numbers; dates (7); names (9) or initials (25) of case pathologists; or hospital or laboratory names (46). In 150 reports, some clinical information was suppressed inadvertently (overmarking). The engine retained eponymic patient names, eg, Barrett and Gleason. In the second evaluation (1,000 reports), the software did not suppress outside (90) or UPMC (6) accession numbers or names (4) or initials (2) of case pathologists. In the third evaluation, the software removed names of patients, hospitals (297/300), pathologists (297/300), transcriptionists, residents and physicians, dates of procedures, and accession numbers (298/300). By the end of the evaluation, the system was reliably and specifically removing safe-harbor identifiers and producing highly readable deidentified text without removing important clinical information. Collaboration between pathology domain experts and system developers and continuous quality assurance are needed to optimize ongoing deidentification processes.

[1]  Behlen Fm,et al.  Multicenter patient records research: security policies and tools. , 1999 .

[2]  Henry J. Lowe,et al.  A proposed key escrow system for secure patient information disclosure in biomedical research databases , 2002, AMIA.

[3]  J. Berman Concept-match medical data scrubbing. How pathology text can be used in research. , 2003, Archives of pathology & laboratory medicine.

[4]  Ricky K. Taira,et al.  Identification of patient name references within medical documents using semantic selectional restrictions , 2002, AMIA.

[5]  Bradley Malin,et al.  Re-identification of DNA through an automated linkage process , 2001, AMIA.

[6]  Clement J. McDonald,et al.  A successful technique for removing names in pathology reports using an augmented search and replace method , 2002, AMIA.

[7]  L. Sweeney Replacing personally-identifying information in medical records, the Scrub system. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[8]  Hhs Office for Civil Rights Standards for privacy of individually identifiable health information. Final rule. , 2002, Federal register.

[9]  Jules J. Berman A tool for sharing annotated research data: the "Category 0" UMLS (Unified Medical Language System) vocabularies , 2003, BMC Medical Informatics Decis. Mak..

[10]  President Kennedy and Addison's disease. , 1967, JAMA.

[11]  John K. Vries,et al.  The medical archival system: An information retrieval system based on distributed parallel processing , 1991, Inf. Process. Manag..

[12]  Latanya Sweeney,et al.  Guaranteeing anonymity when sharing medical data, the Datafly System , 1997, AMIA.