Link prediction by de-anonymization: How We Won the Kaggle Social Network Challenge

This paper describes the winning entry to the IJCNN 2011 Social Network Challenge run by Kaggle.com. The goal of the contest was to promote research on real-world link prediction, and the dataset was a graph obtained by crawling the popular Flickr social photo sharing website, with user identities scrubbed. By de-anonymizing much of the competition test set using our own Flickr crawl, we were able to effectively game the competition. Our attack represents a new application of de-anonymization to gaming machine learning contests, suggesting changes in how future competitions should be run. We introduce a new simulated annealing-based weighted graph matching algorithm for the seeding step of de-anonymization. We also show how to combine de-anonymization with link prediction—the latter is required to achieve good performance on the portion of the test set not de-anonymized—for example by training the predictor on the de-anonymized portion of the test set, and combining probabilistic predictions from de-anonymization and link prediction.

[1]  Frank M. Shipman,et al.  Link prediction applied to an open large-scale online social network , 2010, HT '10.

[2]  Charu C. Aggarwal,et al.  On k-Anonymity and the Curse of Dimensionality , 2005, VLDB.

[3]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[4]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[5]  Endika Bengoetxea,et al.  Inexact Graph Matching Using Estimation of Distribution Algorithms , 2002 .

[6]  Tsuyoshi Murata,et al.  Link Prediction based on Structural Properties of Online Social Networks , 2008, New Generation Computing.

[7]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[8]  Philippe Golle,et al.  On the Anonymity of Home/Work Location Pairs , 2009, Pervasive.

[9]  Ilya Mironov,et al.  Differentially private recommender systems: building privacy into the net , 2009, KDD.

[10]  Vitaly Shmatikov,et al.  De-anonymizing Social Networks , 2009, 2009 30th IEEE Symposium on Security and Privacy.

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  Shinji Umeyama,et al.  An Eigendecomposition Approach to Weighted Graph Matching Problems , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Donald F. Towsley,et al.  Resisting structural re-identification in anonymized social networks , 2010, The VLDB Journal.

[14]  Bruce Hajek,et al.  A tutorial survey of theory and applications of simulated annealing , 1985, 1985 24th IEEE Conference on Decision and Control.

[15]  P. Strevens Iii , 1985 .

[16]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[17]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[18]  John Riedl,et al.  You are what you say: privacy risks of public mentions , 2006, SIGIR '06.

[19]  Alina Campan,et al.  A Clustering Approach for Data and Structural Anonymity in Social Networks , 2008 .

[20]  Jian Pei,et al.  Preserving Privacy in Social Networks Against Neighborhood Attacks , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[21]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[22]  Cynthia Dwork,et al.  Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography , 2007, WWW '07.

[23]  Salih O. Duffuaa,et al.  A Linear Programming Approach for the Weighted Graph Matching Problem , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[25]  K. Liu,et al.  Towards identity anonymization on graphs , 2008, SIGMOD Conference.

[26]  Bradley Malin,et al.  How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems , 2004, J. Biomed. Informatics.