A system for de-identifying medical message board text

There are millions of public posts to medical message boards by users seeking support and information on a wide range of medical conditions. It has been shown that these posts can be used to gain a greater understanding of patients’ experiences and concerns. As investigators continue to explore large corpora of medical discussion board data for research purposes, protecting the privacy of the members of these online communities becomes an important challenge that needs to be met. Extant entity recognition methods used for more structured text are not sufficient because message posts present additional challenges: the posts contain many typographical errors, larger variety of possible names, terms and abbreviations specific to Internet posts or a particular message board, and mentions of the authors’ personal lives. The main contribution of this paper is a system to de-identify the authors of message board posts automatically, taking into account the aforementioned challenges. We demonstrate our system on two different message board corpora, one on breast cancer and another on arthritis. We show that our approach significantly outperforms other publicly available named entity recognition and de-identification systems, which have been tuned for more structured text like operative reports, pathology reports, discharge summaries, or newswire.

[1]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[2]  K. Rodham,et al.  The invisible reality of arthritis: a qualitative analysis of an online message board. , 2008, Musculoskeletal care.

[3]  James R. Curran Proceedings of the COLING/ACL on Interactive presentation sessions , 2006 .

[4]  J. Coyne,et al.  Peer‐support in coping with medical uncertainty: discussion of oophorectomy and hormone replacement therapy on a web‐based message board , 2007, Psycho-oncology.

[5]  Clement J. McDonald,et al.  A successful technique for removing names in pathology reports using an augmented search and replace method , 2002, AMIA.

[6]  L. Sweeney Replacing personally-identifying information in medical records, the Scrub system. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[7]  Li Xiong,et al.  An integrated framework for de-identifying unstructured medical data , 2009, Data Knowl. Eng..

[8]  Peter Szolovits,et al.  Automated de-identification of free-text medical records , 2008, BMC Medical Informatics Decis. Mak..

[9]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[10]  Stephen Pulman,et al.  Evaluating the State of the Art , 1995 .

[11]  Beverly A. Lewin,et al.  Communication in Internet message boards , 2002 .

[12]  Özlem Uzuner,et al.  Role of Local Context in Automatic Deidentification of Ungrammatical, Fragmented Text , 2006, NAACL.

[13]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[14]  Lynette Hirschman,et al.  The MITRE Identification Scrubber Toolkit: Design, training, and assessment , 2010, Int. J. Medical Informatics.

[15]  Ora L Strickland,et al.  The experiences of midlife women with migraines. , 2006, Journal of nursing scholarship : an official publication of Sigma Theta Tau International Honor Society of Nursing.

[16]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[17]  Lyle H. Ungar,et al.  A system for de-identifying medical message board text , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[18]  Ricky K. Taira,et al.  Identification of patient name references within medical documents using semantic selectional restrictions , 2002, AMIA.

[19]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[20]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[21]  William W. Cohen,et al.  Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text , 2005, HLT.