The pattern of name tokens in narrative clinical text and a comparison of five systems for redacting them

Objective To understand the factors that influence success in scrubbing personal names from narrative text. Materials and methods We developed a scrubber, the NLM Name Scrubber (NLM-NS), to redact personal names from narrative clinical reports, hand tagged words in a set of gold standard narrative reports as personal names or not, and measured the scrubbing success of NLM-NS and that of four other scrubbing/name recognition tools (MIST, MITdeid, LingPipe, and ANNIE/GATE) against the gold standard reports. We ran three comparisons which used increasingly larger name lists. Results The test reports contained more than 1 million words, of which 2388 were patient and 20 160 were provider name tokens. NLM-NS failed to scrub only 2 of the 2388 instances of patient name tokens. Its sensitivity was 0.999 on both patient and provider name tokens and missed fewer instances of patient name tokens in all comparisons with other scrubbers. MIST produced the best all token specificity and F-measure for name instances in our most relevant study (study 2), with values of 0.997 and 0.938, respectively. In that same comparison, NLM-NS was second best, with values of 0.986 and 0.748, respectively, and MITdeid was a close third, with values of 0.985 and 0.796 respectively. With the addition of the Clinical Center name list to their native name lists, Ling Pipe, MITdeid, MIST, and ANNIE/GATE all improved substantially. MITdeid and Ling Pipe gained the most—reaching patient name sensitivity of 0.995 (F-measure=0.705) and 0.989 (F-measure=0.386), respectively. Discussion The privacy risk due to two name tokens missed by NLM-NS was statistically negligible, since neither individual could be distinguished among more than 150 000 people listed in the US Social Security Registry. Conclusions The nature and size of name lists have substantial influences on scrubbing success. The use of very large name lists with frequency statistics accounts for much of NLM-NS scrubbing success.

[1]  K. Pearson On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling , 1900 .

[2]  Ram D. Sriram,et al.  Testing the Nation's Healthcare Information Infrastructure: NIST Perspective , 2012, Computer.

[3]  Karl Pearson F.R.S. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling , 2009 .

[4]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[5]  Allen C. Browne,et al.  The Role of Lexical Knowledge in Biomedical Text Understanding. , 1987 .

[6]  Stephen Pulman,et al.  Evaluating the State of the Art , 1995 .

[7]  Clement J. McDonald,et al.  Application of Information Technology: A Software Tool for Removing Patient Identifying Information from Clinical Documents , 2008, J. Am. Medical Informatics Assoc..

[8]  Ulysses J. Balis,et al.  Development and evaluation of an open source software tool for deidentification of pathology reports , 2006, BMC Medical Informatics Decis. Mak..

[9]  van Gerardus Noord,et al.  Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010) , 2010 .

[10]  B. Fitzgerald Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule , 2015 .

[11]  Keith Marsolo,et al.  Large-scale evaluation of automated clinical note de-identification and its impact on information extraction , 2013, J. Am. Medical Informatics Assoc..

[12]  B. Carpenter,et al.  LingPipe for 99.99% Recall of Gene Mentions , 2007 .

[13]  S. Meystre,et al.  Automatic de-identification of textual documents in the electronic health record: a review of recent research , 2010, BMC medical research methodology.

[14]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[15]  R. Mehnert,et al.  A world of knowledge for the nation's health: the U.S. National Library of Medicine. , 1986, American journal of hospital pharmacy.

[16]  Alexander A. Morgan,et al.  Research Paper: Rapidly Retargetable Approaches to De-identification in Medical Records , 2007, J. Am. Medical Informatics Assoc..

[17]  William W. Cohen,et al.  NER Systems that Suit User’s Preferences: Adjusting the Recall-Precision Trade-off for Entity Extraction , 2006, NAACL.

[18]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[19]  I. Kohane Using electronic health records to drive discovery in disease genomics , 2011, Nature Reviews Genetics.

[20]  Nan Ye,et al.  Optimizing F-measure: A Tale of Two Approaches , 2012, ICML.

[21]  Peter Szolovits,et al.  Automated de-identification of free-text medical records , 2008, BMC Medical Informatics Decis. Mak..