Anonymization of administrative billing codes with repeated diagnoses through censoring.

Patient-specific data from electronic medical records (EMRs) is increasingly shared in a de-identified form to support research. However, EMRs are susceptible to noise, error, and variation, which can limit their utility for reuse. One way to enhance the utility of EMRs is to record the number of times diagnosis codes are assigned to a patient when this data is shared. This is, however, challenging because releasing such data may be leveraged to compromise patients' identity. In this paper, we present an approach that, to the best of our knowledge, is the first that can prevent re-identification through repeated diagnosis codes. Our method transforms records to preserve privacy while retaining much of their utility. Experiments conducted using 2676 patients from the EMR system of the Vanderbilt University Medical Center verify that our method is able to retain an average of 95.4% of the diagnosis codes in a common data sharing scenario.

[1]  Grigorios Loukides,et al.  Speeding Up Clustering-Based k -Anonymisation Algorithms with Pre-partitioning , 2007, BNCOD.

[2]  Grigorios Loukides,et al.  Clustering-Based K-Anonymisation Algorithms , 2007, DEXA.

[3]  Bradley Malin,et al.  Evaluating re-identification risks with respect to the HIPAA privacy rule , 2010, J. Am. Medical Informatics Assoc..

[4]  Philip S. Yu,et al.  Anonymizing transaction databases for publication , 2008, KDD.

[5]  D. Roden,et al.  Development of a Large‐Scale De‐Identified DNA Biobank to Enable Personalized Medicine , 2008, Clinical pharmacology and therapeutics.

[6]  Michael Sullivan,et al.  Fundamentals of Statistics , 2004 .

[7]  Hardeep Singh,et al.  Identifying diagnostic errors in primary care using an electronic screening algorithm. , 2007, Archives of internal medicine.

[8]  K. El Emam,et al.  Evaluating Common De-Identification Heuristics for Personal Health Information , 2006, Journal of medical Internet research.

[9]  Bradley Malin,et al.  k-Unlinkability: A privacy protection model for distributed data , 2008, Data Knowl. Eng..

[10]  B. Malin,et al.  Anonymization of electronic medical records for validating genome-wide association studies , 2010, Proceedings of the National Academy of Sciences.

[11]  Bradley Malin,et al.  Technical Evaluation: An Evaluation of the Current State of Genomic Data Privacy Protection Technology and a Roadmap for the Future , 2004, J. Am. Medical Informatics Assoc..

[12]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[13]  Melissa A. Basford,et al.  Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. , 2010, American journal of human genetics.

[14]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[15]  Joshua C. Denny,et al.  The disclosure of diagnosis codes can breach research participants' privacy , 2010, J. Am. Medical Informatics Assoc..

[16]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[17]  Ashwin Machanavajjhala,et al.  Privacy-Preserving Data Publishing , 2009, Found. Trends Databases.