Strategies for De-identification and Anonymization of Electronic Health Record Data for Use in Multicenter Research Studies

Background:De-identification and anonymization are strategies that are used to remove patient identifiers in electronic health record data. The use of these strategies in multicenter research studies is paramount in importance, given the need to share electronic health record data across multiple environments and institutions while safeguarding patient privacy. Methods:Systematic literature search using keywords of de-identify, deidentify, de-identification, deidentification, anonymize, anonymization, data scrubbing, and text scrubbing. Search was conducted up to June 30, 2011 and involved 6 different common literature databases. A total of 1798 prospective citations were identified, and 94 full-text articles met the criteria for review and the corresponding articles were obtained. Search results were supplemented by review of 26 additional full-text articles; a total of 120 full-text articles were reviewed. Results:A final sample of 45 articles met inclusion criteria for review and discussion. Articles were grouped into text, images, and biological sample categories. For text-based strategies, the approaches were segregated into heuristic, lexical, and pattern-based systems versus statistical learning-based systems. For images, approaches that de-identified photographic facial images and magnetic resonance image data were described. For biological samples, approaches that managed the identifiers linked with these samples were discussed, particularly with respect to meeting the anonymization requirements needed for Institutional Review Board exemption under the Common Rule. Conclusions:Current de-identification strategies have their limitations, and statistical learning-based systems have distinct advantages over other approaches for the de-identification of free text. True anonymization is challenging, and further work is needed in the areas of de-identification of datasets and protection of genetic information.

[1]  Daniel Dominic Sleator,et al.  Parsing English with a Link Grammar , 1995, IWPT.

[2]  L. Sweeney Replacing personally-identifying information in medical records, the Scrub system. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[3]  Kári Stefánsson,et al.  Protection of privacy by third-party encryption in genetic research in Iceland , 2000, European Journal of Human Genetics.

[4]  Robert H. Baud,et al.  Medical document anonymization with a semantic lexicon , 2000, AMIA.

[5]  Latanya Sweeney,et al.  Computational disclosure control: a primer on data privacy protection , 2001 .

[6]  Clement J. McDonald,et al.  A successful technique for removing names in pathology reports using an augmented search and replace method , 2002, AMIA.

[7]  Ricky K. Taira,et al.  Identification of patient name references within medical documents using semantic selectional restrictions , 2002, AMIA.

[8]  Kazuhiko Ohe,et al.  Establishment of a method of anonymization of DNA samples in genetic research , 2003, Journal of Human Genetics.

[9]  J. Berman Concept-match medical data scrubbing. How pathology text can be used in research. , 2003, Archives of pathology & laboratory medicine.

[10]  M. Douglass,et al.  Computer-assisted de-identification of free text in the MIMIC II database , 2004, Computers in Cardiology, 2004.

[11]  John R. Gilbertson,et al.  Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. , 2004 .

[12]  J. Gilbertson,et al.  Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. , 2004, American journal of clinical pathology.

[13]  John J. Mentel,et al.  Patient note deidentification using a find-and-replace iterative process. , 2005, Journal of healthcare information management : JHIM.

[14]  A. Reisner,et al.  De-identification algorithm for free-text nursing notes , 2005, Computers in Cardiology, 2005.

[15]  Wlodzislaw Duch,et al.  Preparing Clinical Text for Use in Biomedical Research , 2006, J. Database Manag..

[16]  K. Ohe,et al.  Automatic Deidentification by using Sentence Features and Label Consistency , 2006 .

[17]  Ulysses J. Balis,et al.  Development and evaluation of an open source software tool for deidentification of pathology reports , 2006, BMC Medical Informatics Decis. Mak..

[18]  John F. Hurdle,et al.  Assessing the Difficulty and Time Cost of De-identification in Clinical Narratives , 2006, Methods of Information in Medicine.

[19]  Peter Szolovits,et al.  Syntactically-informed semantic category recognition in discharge summaries. , 2006, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[20]  M. Hepple,et al.  Identifying Personal Health Information Using Support Vector Machines , 2006 .

[21]  Tawanda C. Sibanda,et al.  Was the Patient Cured? Understanding Semantic Categories and Their Relationships in Patient Records , 2006 .

[22]  Alexander A. Morgan,et al.  Research Paper: Rapidly Retargetable Approaches to De-identification in Medical Records , 2007, J. Am. Medical Informatics Assoc..

[23]  Gregory G. Brown,et al.  A technique for the deidentification of structural brain MR images , 2007, Human brain mapping.

[24]  Róbert Busa-Fekete,et al.  State-of-the-art anonymization of medical records using an iterative machine learning framework. , 2007 .

[25]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[26]  Charles R Meyer,et al.  A web-based interface for communication of data between the clinical and research environments without revealing identifying information. , 2007, Academic radiology.

[27]  Peter Szolovits,et al.  A de-identifier for medical discharge summaries , 2008, Artif. Intell. Medicine.

[28]  Li Xiong,et al.  HIDE: An Integrated System for Health Information DE-identification , 2008, 2008 21st IEEE International Symposium on Computer-Based Medical Systems.

[29]  D. Roden,et al.  Development of a Large‐Scale De‐Identified DNA Biobank to Enable Personalized Medicine , 2008, Clinical pharmacology and therapeutics.

[30]  Peter Szolovits,et al.  Automated de-identification of free-text medical records , 2008, BMC Medical Informatics Decis. Mak..

[31]  Stephen M. Moore,et al.  Collecting 48,000 CT Exams for the Lung Screening Study of the National Lung Screening Trial , 2009, Journal of Digital Imaging.

[32]  Chia-Hung Hsiao,et al.  Embedding a Hiding Function in a Portable Electronic Health Record for Privacy Preservation , 2008, Journal of Medical Systems.

[33]  Clement J. McDonald,et al.  Application of Information Technology: A Software Tool for Removing Patient Identifying Information from Clinical Documents , 2008, J. Am. Medical Informatics Assoc..

[34]  Jörg Riesmeier,et al.  Reversible Anonymization of DICOM Images Using Automatically Generated Policies , 2009, MIE.

[35]  George Hripcsak,et al.  Using a pipeline to improve de-identification performance , 2009, AMIA.

[36]  Jeanmarie Mayer,et al.  Inductive Creation of an Annotation Schema and a Reference Standard for De-identification of VA Electronic Clinical Notes , 2009, AMIA.

[37]  Li Li,et al.  Viewpoint Paper: Repurposing the Clinical Record: Can an Existing Natural Language Processing System De-identify Clinical Notes? , 2009, J. Am. Medical Informatics Assoc..

[38]  Pierre Zweigenbaum,et al.  Testing Tactics to Localize De-Identification , 2009, MIE.

[39]  Charles Hildebolt,et al.  Facial Recognition From Volume-Rendered Magnetic Resonance Imaging Data , 2009, IEEE Transactions on Information Technology in Biomedicine.

[40]  Sumithra Velupillai,et al.  Developing a standard for de-identifying electronic patient records written in Swedish: Precision, recall and F-measure in a manual and computerized annotation trial , 2009, Int. J. Medical Informatics.

[41]  Michael Gillam,et al.  An automatic system to detect and extract texts in medical images for de-identification , 2010, Medical Imaging.

[42]  J. Wardlaw,et al.  An open source toolkit for medical imaging de-identification , 2010, European Radiology.

[43]  Lynette Hirschman,et al.  Effects of personal identifier resynthesis on clinical text de-identification , 2010, J. Am. Medical Informatics Assoc..

[44]  Sumithra Velupillai,et al.  De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields , 2010, J. Biomed. Semant..

[45]  Karen Tu,et al.  De-identification of primary care electronic medical records free-text data in Ontario, Canada , 2010, BMC Medical Informatics Decis. Mak..

[46]  Lynette Hirschman,et al.  The MITRE Identification Scrubber Toolkit: Design, training, and assessment , 2010, Int. J. Medical Informatics.

[47]  J. Jacko,et al.  Deidentification of facial images using composites. , 2011, Journal of oral and maxillofacial surgery : official journal of the American Association of Oral and Maxillofacial Surgeons.

[48]  Yasuhiro Fujiwara,et al.  De-identification procedure and sample quality of the post-clinical test samples at the bio-repository of the National Cancer Center Hospital (NCCH) in Tokyo. , 2011, Japanese journal of clinical oncology.