NRF: A Naive Re-identification Framework

The promise of big data relies on the release and aggregation of data sets. When these data sets contain sensitive information about individuals, it has been scalable and convenient to protect the privacy of these individuals by de-identification. However, studies show that the combination of de-identified data sets with other data sets risks re-identification of some records. Some studies have shown how to measure this risk in specific contexts where certain types of public data sets (such as voter roles) are assumed to be available to attackers. To the extent that it can be accomplished, such analyses enable the threat of compromises to be balanced against the benefits of sharing data. For example, a study that might save lives by enabling medical research may be enabled in light of a sufficiently low probability of compromise from sharing de-identified data. In this paper, we introduce a general probabilistic re-identification framework that can be instantiated in specific contexts to estimate the probability of compromises based on explicit assumptions. We further propose a baseline of such assumptions that enable a first-cut estimate of risk for practical case studies. We refer to the framework with these assumptions as the Naive Re-identification Framework (NRF). As a case study, we show how we can apply NRF to analyze and quantify the risk of re-identification arising from releasing de-identified medical data in the context of publicly-available social media data. The results of this case study show that NRF can be used to obtain meaningful quantification of the re-identification risk, compare the risk of different social media, and assess risks of combinations of various demographic attributes and medical conditions that individuals may voluntarily disclose on social media.

[1]  Krishna P. Gummadi,et al.  You are who you know: inferring user profiles in online social networks , 2010, WSDM '10.

[2]  Khaled El Emam,et al.  Guide to the De-Identification of Personal Health Information , 2013 .

[3]  Joshua C. Denny,et al.  The disclosure of diagnosis codes can breach research participants' privacy , 2010, J. Am. Medical Informatics Assoc..

[4]  Daniel C. Barth-Jones,et al.  The 'Re-Identification' of Governor William Weld's Medical Information: A Critical Re-Examination of Health Data Identification Risks and Privacy Protections, Then and Now , 2012 .

[5]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[6]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[7]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[8]  Raghu Ramakrishnan,et al.  Privacy Skyline: Privacy with Multidimensional Adversarial Knowledge , 2007, VLDB.

[9]  Chris Clifton,et al.  Differential identifiability , 2012, KDD.

[10]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[11]  Margaret L. Isaac,et al.  Sounding off on social media: the ethics of patient storytelling in the modern era. , 2015, Academic medicine : journal of the Association of American Medical Colleges.

[12]  Massimo Barbaro,et al.  A Face Is Exposed for AOL Searcher No , 2006 .

[13]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[14]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[15]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[16]  Gerhard Weikum,et al.  Probabilistic Prediction of Privacy Risks in User Search Histories , 2014, PSBD '14.

[17]  B. Fitzgerald Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule , 2015 .

[18]  B. Lo Sharing clinical trial data: maximizing benefits, minimizing risk. , 2015, JAMA.

[19]  David M. Nicol,et al.  unFriendly: Multi-party Privacy Risks in Social Networks , 2010, Privacy Enhancing Technologies.

[20]  Keith W. Ross,et al.  Estimating age privacy leakage in online social networks , 2012, 2012 Proceedings IEEE INFOCOM.

[21]  Reza Zafarani,et al.  Connecting users across social media sites: a behavioral-modeling approach , 2013, KDD.

[22]  J. Lucas,et al.  Summary health statistics for U.S. adults: National Health Interview Survey, 2009. , 2010, Vital and health statistics. Series 10, Data from the National Health Survey.

[23]  Aron Culotta,et al.  Predicting the Demographics of Twitter Users from Website Traffic Data , 2015, AAAI.

[24]  Ashwin Machanavajjhala,et al.  Data Publishing against Realistic Adversaries , 2009, Proc. VLDB Endow..

[25]  L Sweeney,et al.  Weaving Technology and Policy Together to Maintain Confidentiality , 1997, Journal of Law, Medicine & Ethics.

[26]  Khaled El Emam,et al.  Anonymizing Health Data: Case Studies and Methods to Get You Started , 2013 .

[27]  Hongxia Jin,et al.  Controllable Information Sharing for User Accounts Linkage across Multiple Online Social Networks , 2014, CIKM.