Provable De-anonymization of Large Datasets with Sparse Dimensions

There is a significant body of empirical work on statistical de-anonymization attacks against databases containing micro-data about individuals, e.g., their preferences, movie ratings, or transaction data. Our goal is to analytically explain why such attacks work. Specifically, we analyze a variant of the Narayanan-Shmatikov algorithm that was used to effectively de-anonymize the Netflix database of movie ratings. We prove theorems characterizing mathematical properties of the database and the auxiliary information available to the adversary that enable two classes of privacy attacks. In the first attack, the adversary successfully identifies the individual about whom she possesses auxiliary information (an isolation attack). In the second attack, the adversary learns additional information about the individual, although she may not be able to uniquely identify him (an information amplification attack). We demonstrate the applicability of the analytical results by empirically verifying that the mathematical properties assumed of the database are actually true for a significant fraction of the records in the Netflix movie ratings database, which contains ratings from about 500,000 users.

[1]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[2]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[3]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[4]  Vitaly Shmatikov,et al.  Myths and fallacies of "Personally Identifiable Information" , 2010, Commun. ACM.

[5]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[6]  Yufei Tao,et al.  M-invariance: towards privacy preserving re-publication of dynamic datasets , 2007, SIGMOD '07.

[7]  Robin Milner,et al.  On Observing Nondeterminism and Concurrency , 1980, ICALP.

[8]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[9]  Massimo Barbaro,et al.  A Face Is Exposed for AOL Searcher No , 2006 .

[10]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[11]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[12]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[13]  John Riedl,et al.  You are what you say: privacy risks of public mentions , 2006, SIGIR '06.

[14]  Vijay Atluri,et al.  Computer Security – ESORICS 2011 , 2011, Lecture Notes in Computer Science.

[15]  Michele Boreale,et al.  Quantitative Information Flow, with a View , 2011, ESORICS.