The ability to access large amounts of de-identified clinical data would facilitate epidemiologic and retrospective research. Previously described de-identification methods require knowledge of natural language processing or have not been made available to the public. We take advantage of the fact that the vast majority of proper names in pathology reports occur in pairs. In rare cases where one proper name is by itself, it is preceded or followed by an affix that identifies it as a proper name (Mrs., Dr., PhD). We created a tool based on this observation using substitution methods that was easy to implement and was largely based on publicly available data sources. We compiled a Clinical and Common Usage Word (CCUW) list as well as a fairly comprehensive proper name list. Despite the large overlap between these two lists, we were able to refine our methods to achieve accuracy similar to previous attempts at de-identification. Our method found 98.7% of 231 proper names in the narrative sections of pathology reports. Three single proper names were missed out of 1001 pathology reports (0.3%, no first name/last name pairs). It is unlikely that identification could be implied from this information. We will continue to refine our methods, specifically working to improve the quality of our CCUW and proper name lists to obtain higher levels of accuracy.
[1]
Hhs Office for Civil Rights.
Standards for privacy of individually identifiable health information. Final rule.
,
2002,
Federal register.
[2]
Robert H. Baud,et al.
Medical document anonymization with a semantic lexicon
,
2000,
AMIA.
[3]
L. Sweeney.
Replacing personally-identifying information in medical records, the Scrub system.
,
1996,
Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.
[4]
Latanya Sweeney,et al.
Guaranteeing anonymity when sharing medical data, the Datafly System
,
1997,
AMIA.