Health Data in an Open World

With the aim of informing sound policy about data sharing and privacy, we describe successful re-identification of patients in an Australian de-identified open health dataset. As in prior studies of similar datasets, a few mundane facts often suffice to isolate an individual. Some people can be identified by name based on publicly available information. Decreasing the precision of the unit-record level data, or perturbing it statistically, makes re-identification gradually harder at a substantial cost to utility. We also examine the value of related datasets in improving the accuracy and confidence of re-identification. Our re-identifications were performed on a 10% sample dataset, but a related open Australian dataset allows us to infer with high confidence that some individuals in the sample have been correctly re-identified. Finally, we examine the combination of the open datasets with some commercial datasets that are known to exist but are not in our possession. We show that they would further increase the ease of re-identification.

[1]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[2]  Massimo Barbaro,et al.  A Face Is Exposed for AOL Searcher No , 2006 .

[3]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[4]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[5]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[6]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[7]  Paul Ohm Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization , 2009 .

[8]  Vitaly Shmatikov,et al.  Myths and fallacies of "Personally Identifiable Information" , 2010, Commun. ACM.

[9]  Elaine Shi,et al.  Link prediction by de-anonymization: How We Won the Kaggle Social Network Challenge , 2011, The 2011 International Joint Conference on Neural Networks.

[10]  Michael Hicks,et al.  Deanonymizing mobility traces: using social network as a side-channel , 2012, CCS.

[11]  César A. Hidalgo,et al.  Unique in the Crowd: The privacy bounds of human mobility , 2013, Scientific Reports.

[12]  Luca Trevisan,et al.  Theory and Applications of Models of Computation , 2013, Lecture Notes in Computer Science.

[13]  Michael Naehrig,et al.  Private Predictive Analysis on Encrypted Medical Data , 2014, IACR Cryptol. ePrint Arch..

[14]  Yaniv Erlich,et al.  Routes for breaching and protecting genetic privacy , 2013, Nature Reviews Genetics.

[15]  Michael Morrison,et al.  Dynamic consent: a patient interface for twenty-first century research networks , 2014, European Journal of Human Genetics.

[16]  Y. de Montjoye,et al.  Unique in the shopping mall: On the reidentifiability of credit card metadata , 2015, Science.

[17]  M. Colvin,et al.  Computer researchers discover how to reverse encrypted medical information in Health Department dataset , 2016 .

[18]  B. Gittins Input to the Commission on Enhancing National Cybersecurity , 2016 .

[19]  Dr Chris Culnane Response to the Productivity Commission ' s draft report on data availability and use , 2016 .

[20]  Josep Domingo-Ferrer,et al.  Comment on “Unique in the shopping mall: On the reidentifiability of credit card metadata” , 2015, Science.

[21]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.