Location Bias of Identifiers in Clinical Narratives

Scrubbing identifying information from narrative clinical documents is a critical first step to preparing the data for secondary use purposes, such as translational research. Evidence suggests that the differential distribution of protected health information (PHI) in clinical documents could be used as additional features to improve the performance of automated de-identification algorithms or toolkits. However, there has been little investigation into the extent to which such phenomena transpires in practice. To empirically assess this issue, we identified the location of PHI in 140,000 clinical notes from an electronic health record system and characterized the distribution as a function of location in a document. In addition, we calculated the 'word proximity' of nearby PHI elements to determine their co-occurrence rates. The PHI elements were found to have non-random distribution patterns. Location within a document and proximity between PHI elements might therefore be used to help de-identification systems better label PHI.

[1]  Joel H. Saltz,et al.  An evaluation of feature sets and sampling techniques for de-identification of medical records , 2010, IHI.

[2]  D. Blumenthal,et al.  Achieving a Nationwide Learning Health System , 2010, Science Translational Medicine.

[3]  Kai Zheng,et al.  Hedging their Mets: The Use of Uncertainty Terms in Clinical Documents and its Potential Implications when Sharing the Documents with Patients , 2012, AMIA.

[4]  S. Meystre,et al.  Automatic de-identification of textual documents in the electronic health record: a review of recent research , 2010, BMC medical research methodology.

[5]  Shuying Shen,et al.  BoB, a best-of-breed automated text de-identification system for VHA clinical documents , 2013, J. Am. Medical Informatics Assoc..

[6]  Hua Xu,et al.  Data from clinical notes: a perspective on the tension between structure and flexible documentation , 2011, J. Am. Medical Informatics Assoc..

[7]  Clement J. McDonald,et al.  A successful technique for removing names in pathology reports using an augmented search and replace method , 2002, AMIA.

[8]  L. Weed Medical records that guide and teach. , 1968, The New England journal of medicine.

[9]  Lucila Ohno-Machado,et al.  Realizing the full potential of electronic health records: the role of natural language processing , 2011, J. Am. Medical Informatics Assoc..

[10]  Hua Xu,et al.  Portability of an algorithm to identify rheumatoid arthritis in electronic health records , 2012, J. Am. Medical Informatics Assoc..

[11]  Lynette Hirschman,et al.  The MITRE Identification Scrubber Toolkit: Design, training, and assessment , 2010, Int. J. Medical Informatics.

[12]  Randolph A. Miller,et al.  Research Paper: Evaluation of a Method to Identify and Categorize Section Headers in Clinical Documents , 2009, J. Am. Medical Informatics Assoc..

[13]  Peter Szolovits,et al.  A de-identifier for medical discharge summaries , 2008, Artif. Intell. Medicine.

[14]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[15]  Christopher D. Manning,et al.  An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition , 2006, ACL.

[16]  Ricky K. Taira,et al.  Identification of patient name references within medical documents using semantic selectional restrictions , 2002, AMIA.

[17]  Róbert Busa-Fekete,et al.  State-of-the-art anonymization of medical records using an iterative machine learning framework. , 2007 .

[18]  Andrew McCallum,et al.  Collective Segmentation and Labeling of Distant Entities in Information Extraction , 2004 .

[19]  Yuta Tsuboi,et al.  Feature-Rich Information Extraction for the Technical Trend-Map Creation , 2010, NTCIR.

[20]  Bradley Malin,et al.  Biomedical data privacy: problems, perspectives, and recent advances , 2013, J. Am. Medical Informatics Assoc..

[21]  Adilson E. Motter,et al.  Beyond Word Frequency: Bursts, Lulls, and Scaling in the Temporal Distributions of Words , 2009, PloS one.

[22]  Keith Marsolo,et al.  Large-scale evaluation of automated clinical note de-identification and its impact on information extraction , 2013, J. Am. Medical Informatics Assoc..

[23]  Guo-Hui Li,et al.  Mining Chinese comparative sentences by semantic role labeling , 2008, 2008 International Conference on Machine Learning and Cybernetics.

[24]  S. Meystre,et al.  Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents , 2012, BMC Medical Research Methodology.

[25]  Juan Liu,et al.  Question classification based on an extended class sequential rule model , 2011, IJCNLP.

[26]  Deborah A. Nichols,et al.  Strategies for De-identification and Anonymization of Electronic Health Record Data for Use in Multicenter Research Studies , 2012, Medical care.

[27]  Son Doan,et al.  Application of information technology: MedEx: a medication information extraction system for clinical narratives , 2010, J. Am. Medical Informatics Assoc..

[28]  M. Hepple,et al.  Identifying Personal Health Information Using Support Vector Machines , 2006 .

[29]  Ulysses J. Balis,et al.  Development and evaluation of an open source software tool for deidentification of pathology reports , 2006, BMC Medical Informatics Decis. Mak..

[30]  Seena Zierler-Brown,et al.  Clinical documentation for patient care: models, concepts, and liability considerations for pharmacists. , 2007, American journal of health-system pharmacy : AJHP : official journal of the American Society of Health-System Pharmacists.

[31]  Hongfei Lin,et al.  BioPPISVMExtractor: A protein-protein interaction extractor for biomedical literature using SVM and rich feature sets , 2010, J. Biomed. Informatics.

[32]  Bo Gu,et al.  Automatic Labeling of Semantic Role on Chinese FrameNet Using Conditional Random Fields , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[33]  Tetsuji Nakagawa,et al.  Multilingual Dependency Parsing Using Global Features , 2007, EMNLP.

[34]  Joshua C. Denny,et al.  Tracking medical students' clinical experiences using natural language processing , 2009, J. Biomed. Informatics.

[35]  M. Douglass,et al.  Computer-assisted de-identification of free text in the MIMIC II database , 2004, Computers in Cardiology, 2004.

[36]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[37]  Kai Zheng,et al.  Bootstrapping a de-identification system for narrative patient records: Cost-performance tradeoffs , 2013, Int. J. Medical Informatics.

[38]  Kenneth Ward Church,et al.  Poisson mixtures , 1995, Natural Language Engineering.

[39]  K. Ohe,et al.  Automatic Deidentification by using Sentence Features and Label Consistency , 2006 .

[40]  Li Xiong,et al.  HIDE: An Integrated System for Health Information DE-identification , 2008, 2008 21st IEEE International Symposium on Computer-Based Medical Systems.

[41]  D. Roden,et al.  Development of a Large‐Scale De‐Identified DNA Biobank to Enable Personalized Medicine , 2008, Clinical pharmacology and therapeutics.

[42]  Lei Yang,et al.  Voice-dictated versus typed-in clinician notes: linguistic properties and the potential implications on natural language processing. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.