An Open Source Tool for Game Theoretic Health Data De-Identification

Biomedical data continues to grow in quantity and quality, creating new opportunities for research and data-driven applications. To realize these activities at scale, data must be shared beyond its initial point of collection. To maintain privacy, healthcare organizations often de-identify data, but they assume worst-case adversaries, inducing high levels of data corruption. Recently, game theory has been proposed to account for the incentives of data publishers and recipients (who attempt to re-identify patients), but this perspective has been more hypothetical than practical. In this paper, we report on a new game theoretic data publication strategy and its integration into the open source software ARX. We evaluate our implementation with an analysis on the relationship between data transformation, utility, and efficiency for over 30,000 demographic records drawn from the U.S. Census Bureau. The results indicate that our implementation is scalable and can be combined with various data privacy risk and quality measures.

[1]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[2]  Murat Kantarcioglu,et al.  Expanding Access to Large-Scale Genomic Data While Promoting Privacy: A Game Theoretic Approach. , 2017, American journal of human genetics.

[3]  Raymond Heatherly,et al.  A Game Theoretic Framework for Analyzing Re-Identification Risk , 2015, PloS one.

[4]  Michael P. Wellman,et al.  Strategic Modeling of Information Sharing among Data Privacy Attackers , 2010, Informatica.

[5]  V. Liu,et al.  Data breaches of protected health information in the United States. , 2015, JAMA.

[6]  Bradley Malin,et al.  Technical and Policy Approaches to Balancing Patient Privacy and Data Sharing in Clinical and Translational Research , 2010, Journal of Investigative Medicine.

[7]  Melissa A. Basford,et al.  Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data , 2013, Nature Biotechnology.

[8]  Fabian Prasser,et al.  Putting Statistical Disclosure Control into Practice: The ARX Data Anonymization Tool , 2015, Medical Data Privacy Handbook.

[9]  Riccardo Miotto,et al.  Translational bioinformatics in the era of real-time biomedical, health care and wellness data streams , 2016, Briefings Bioinform..

[10]  S. Schneeweiss Learning from big health care data. , 2014, The New England journal of medicine.

[11]  Fabian Prasser,et al.  Lightning: Utility-Driven Anonymization of High-Dimensional Data , 2016, Trans. Data Priv..

[12]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[13]  Cheong-Ghil Kim,et al.  Protecting Privacy Using K-Anonymity with a Hybrid Search Scheme , 2012 .

[14]  Fabian Prasser,et al.  Efficient and effective pruning strategies for health data de-identification , 2016, BMC Medical Informatics and Decision Making.

[15]  Pierangela Samarati,et al.  Generalizing Data to Provide Anonymity when Disclosing Information , 1998, PODS 1998.

[16]  Arvind Narayanan,et al.  No silver bullet: De-identification still doesn't work , 2014 .

[17]  Bradley Malin,et al.  Assessing data intrusion threats. , 2015, Science.