Learning to de-anonymize social networks

Releasing anonymized social network data for analysis has been a popular idea among data providers. Despite evidence to the contrary the belief that anonymization will solve the privacy problem in practice refuses to die. This dissertation contributes to the field of social graph de-anonymization by demonstrating that even automated models can be quite successful in breaching the privacy of such datasets. We propose novel machine-learning based techniques to learn the identities of nodes in social graphs, thereby automating manual, heuristic-based attacks. Our work extends the vast literature of social graph de-anonymization attacks by systematizing them. We present a random-forests based classifier which uses structural node features based on neighborhood degree distribution to predict their similarity. Using these simple and efficient features we design versatile and expressive learning models which can learn the de-anonymization task just from a few examples. Our evaluation establishes their efficacy in transforming de-anonymization to a learning problem. The learning is transferable in that the model can be trained to attack one graph when trained on another. Moving on, we demonstrate the versatility and greater applicability of the proposed model by using it to solve the long-standing problem of benchmarking social graph anonymization schemes. Our framework bridges a fundamental research gap by making cheap, quick and automated analysis of anonymization schemes possible, without even requiring their full description. The benchmark is based on comparison of structural information leakage vs. utility preservation. We study the trade-off of anonymity vs. utility for six popular anonymization schemes including those promising k-anonymity. Our analysis shows that none of the schemes are fit for the purpose. Finally, we present an end-to-end social graph de-anonymization attack which uses the proposed machine learning techniques to recover node mappings across intersecting graphs. Our attack enhances the state of art in graph de-anonymization by demonstrating better performance than all the other attacks including those that use seed knowledge. The attack is seedless and heuristic free, which demonstrates the superiority of machine learning techniques as compared to hand-selected parametric attacks. 3 4 Acknowledgments First and foremost, I would like to thank my supervisor Ross Anderson without whom this dissertation would have not been possible. He helped me at critical junctures, provided encouragement and valuable feedback, I owe him much gratitude for whatever I managed to achieve at Cambridge. I thank George Danezis for mentoring me during the initial stages of my PhD and teaching me how to …

[1]  A. Felt Privacy Protection for Social Networking APIs , 2008 .

[2]  Adam Meyerson,et al.  On the complexity of optimal K-anonymity , 2004, PODS.

[3]  Assaf Schuster,et al.  Data mining with differential privacy , 2010, KDD.

[4]  Philip H. S. Torr,et al.  Randomized trees for human pose detection , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[6]  Krishna P. Gummadi,et al.  An analysis of social network-based Sybil defenses , 2010, SIGCOMM '10.

[7]  Xintao Wu,et al.  Preserving Differential Privacy in Degree-Correlation based Graph Generation , 2013, Trans. Data Priv..

[8]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[9]  Vitaly Shmatikov,et al.  Myths and fallacies of "Personally Identifiable Information" , 2010, Commun. ACM.

[10]  Shouling Ji,et al.  Structural Data De-anonymization: Quantification, Practice, and Implications , 2014, CCS.

[11]  Tetsuji Kuboyama,et al.  Content-Based De-anonymisation of Tweets , 2011, 2011 Seventh International Conference on Intelligent Information Hiding and Multimedia Signal Processing.

[12]  Prateek Mittal,et al.  SecGraph: A Uniform and Open-source Evaluation System for Graph Data Anonymization and De-anonymization , 2015, USENIX Security Symposium.

[13]  Nikita Borisov,et al.  FlyByNight: mitigating the privacy risks of social networking , 2008, WPES '08.

[14]  Sofya Raskhodnikova,et al.  Private analysis of graph structure , 2011, Proc. VLDB Endow..

[15]  László Babai,et al.  Graph isomorphism in quasipolynomial time [extended abstract] , 2015, STOC.

[16]  Bradley Malin,et al.  How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems , 2004, J. Biomed. Informatics.

[17]  Balachander Krishnamurthy,et al.  Class-based graph anonymization for social network data , 2009, Proc. VLDB Endow..

[18]  Michael K. Reiter,et al.  The Challenges of Effectively Anonymizing Network Data , 2009, 2009 Cybersecurity Applications & Technology Conference for Homeland Security.

[19]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[20]  Philippe Golle,et al.  Private social network analysis: how to assemble pieces of a graph privately , 2006, WPES '06.

[21]  Josep Domingo-Ferrer,et al.  A Critique of k-Anonymity and Some of Its Enhancements , 2008, 2008 Third International Conference on Availability, Reliability and Security.

[22]  Xiaowei Ying,et al.  Comparisons of randomization and K-degree anonymization schemes for privacy preserving social network publishing , 2009, SNA-KDD '09.

[23]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[24]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[25]  Vitaly Shmatikov,et al.  De-anonymizing Social Networks , 2009, 2009 30th IEEE Symposium on Security and Privacy.

[26]  Junyuan Xie,et al.  On the feasibility of user de-anonymization from shared mobile sensor data , 2012, PhoneSense '12.

[27]  Feng Xiao,et al.  SybilLimit: A Near-Optimal Social Network Defense against Sybil Attacks , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[28]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[29]  Linyuan Lu,et al.  Link Prediction in Complex Networks: A Survey , 2010, ArXiv.

[30]  Paul M. Schwartz,et al.  The PII Problem: Privacy and a New Concept of Personally Identifiable Information , 2011 .

[31]  Ninghui Li,et al.  Provably Private Data Anonymization: Or, k-Anonymity Meets Differential Privacy , 2011, ArXiv.

[32]  George Danezis,et al.  SybilInfer: Detecting Sybil Nodes using Social Networks , 2009, NDSS.

[33]  Jian Pei,et al.  A brief survey on anonymization techniques for privacy preserving publishing of social network data , 2008, SKDD.

[34]  George Danezis,et al.  GENERAL TERMS , 2003 .

[35]  Qian Xiao,et al.  Differentially private network data release via structural inference , 2014, KDD.

[36]  Xing Xie,et al.  Privacy Risk in Anonymized Heterogeneous Information Networks , 2014, EDBT.

[37]  Erez Shmueli,et al.  openPDS: Protecting the Privacy of Metadata through SafeAnswers , 2014, PloS one.

[38]  Anupam Datta,et al.  Provable De-anonymization of Large Datasets with Sparse Dimensions , 2012, POST.

[39]  Michael Kaminsky,et al.  SybilGuard: Defending Against Sybil Attacks via Social Networks , 2008, IEEE/ACM Transactions on Networking.

[40]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[41]  Marco Mamei,et al.  Re-identification of anonymized CDR datasets using social network data , 2014, 2014 IEEE International Conference on Pervasive Computing and Communication Workshops (PERCOM WORKSHOPS).

[42]  Silvio Lattanzi,et al.  SoK: The Evolution of Sybil Defense via Social Networks , 2013, 2013 IEEE Symposium on Security and Privacy.

[43]  Sándor Imre,et al.  Measuring importance of seeding for structural de-anonymization attacks in social networks , 2014, 2014 IEEE International Conference on Pervasive Computing and Communication Workshops (PERCOM WORKSHOPS).

[44]  Elaine Shi,et al.  Link prediction by de-anonymization: How We Won the Kaggle Social Network Challenge , 2011, The 2011 International Joint Conference on Neural Networks.

[45]  Jian Pei,et al.  The k-anonymity and l-diversity approaches for privacy preservation in social networks against neighborhood attacks , 2011, Knowledge and Information Systems.

[46]  Lei Zou,et al.  K-Automorphism: A General Framework For Privacy Preserving Network Publication , 2009, Proc. VLDB Endow..

[47]  Tamir Tassa,et al.  Identity obfuscation in graphs through the information theoretic lens , 2011, ICDE.

[48]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[49]  Ben Y. Zhao,et al.  Sharing graphs using differentially private graph models , 2011, IMC '11.

[50]  Ting Yu,et al.  Anonymizing bipartite graph data using safe groupings , 2008, Proc. VLDB Endow..

[51]  Cheng Soon Ong,et al.  Multiclass multiple kernel learning , 2007, ICML '07.

[52]  Prateek Mittal,et al.  LinkMirage: Enabling Privacy-preserving Analytics on Social Relationships , 2016, NDSS.

[53]  George Danezis,et al.  An Automated Social Graph De-anonymization Technique , 2014, WPES.

[54]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[55]  David D. Jensen,et al.  Accurate Estimation of the Degree Distribution of Private Networks , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[56]  Dawn Xiaodong Song,et al.  Preserving Link Privacy in Social Network Based Systems , 2012, NDSS.

[57]  Claude Castelluccia,et al.  Differentially private sequential data publication via variable-length n-grams , 2012, CCS.

[58]  Gianluca Stringhini,et al.  COMPA: Detecting Compromised Accounts on Social Networks , 2013, NDSS.

[59]  Khaled El Emam,et al.  Estimating the re-identification risk of clinical data sets , 2012, BMC Medical Informatics and Decision Making.

[60]  Antonio Criminisi,et al.  Decision Forests with Long-Range Spatial Context for Organ Localization in CT Volumes , 2009 .

[61]  Aleksandra B. Slavkovic,et al.  Differentially Private Graphical Degree Sequences and Synthetic Graphs , 2012, Privacy in Statistical Databases.

[62]  Vitaly Shmatikov,et al.  2011 IEEE Symposium on Security and Privacy “You Might Also Like:” Privacy Risks of Collaborative Filtering , 2022 .

[63]  Andrew W. Fitzgibbon,et al.  Efficient regression of general-activity human poses from depth images , 2011, 2011 International Conference on Computer Vision.

[64]  Avrim Blum,et al.  Differentially private data analysis of social networks via restricted sensitivity , 2012, ITCS '13.

[65]  Stefan Bender,et al.  Re-identifying Register Data by Survey Data Using Cluster Analysis: An Empirical Study , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[66]  Moni Naor,et al.  Differential privacy under continual observation , 2010, STOC '10.

[67]  Lise Getoor,et al.  Preserving the Privacy of Sensitive Relationships in Graph Data , 2007, PinKDD.

[68]  Jia Liu,et al.  K-isomorphism: privacy preserving network publication against structural attacks , 2010, SIGMOD Conference.

[69]  Tobias Friedrich,et al.  De-anonymization of Heterogeneous Random Graphs in Quasilinear Time , 2014, Algorithmica.

[70]  Christopher Krügel,et al.  A Practical Attack to De-anonymize Social Network Users , 2010, 2010 IEEE Symposium on Security and Privacy.

[71]  Xiaowei Ying,et al.  Graph Generation with Prescribed Feature Constraints , 2009, SDM.

[72]  Rachel Greenstadt,et al.  A Critical Evaluation of Website Fingerprinting Attacks , 2014, CCS.

[73]  Xiaowei Ying,et al.  Randomizing Social Networks: a Spectrum Preserving Approach , 2008, SDM.

[74]  Irfan A. Essa,et al.  Tree-based Classifiers for Bilayer Video Segmentation , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[75]  Roberto Cipolla,et al.  Semantic texton forests for image categorization and segmentation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[76]  Jayakrishnan Unnikrishnan,et al.  De-anonymizing private data by matching statistics , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[77]  Sofya Raskhodnikova,et al.  Smooth sensitivity and sampling in private data analysis , 2007, STOC '07.

[78]  Frank M. Shipman,et al.  Link prediction applied to an open large-scale online social network , 2010, HT '10.

[79]  Sandor Imre,et al.  Analysis of Grasshopper, a Novel Social Network De-anonymization Algorithm , 2014 .

[80]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[81]  Lise Getoor,et al.  Combining Collective Classification and Link Prediction , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[82]  Zhuowen Tu,et al.  Probabilistic boosting-tree: learning discriminative models for classification, recognition, and clustering , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[83]  Robin I. M. Dunbar,et al.  Social network size in humans , 2003, Human nature.

[84]  Sándor Imre,et al.  Measuring Local Topological Anonymity in Social Networks , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[85]  Danfeng Yao,et al.  The union-split algorithm and cluster-based anonymization of social networks , 2009, ASIACCS '09.

[86]  Charu C. Aggarwal,et al.  On k-Anonymity and the Curse of Dimensionality , 2005, VLDB.

[87]  Alina Campan,et al.  Data and Structural k-Anonymity in Social Networks , 2009, PinKDD.

[88]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[89]  Siddharth Srivastava,et al.  Anonymizing Social Networks , 2007 .

[90]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[91]  Antonio Criminisi,et al.  Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning , 2012, Found. Trends Comput. Graph. Vis..

[92]  Matthias Grossglauser,et al.  On the performance of percolation graph matching , 2013, COSN '13.

[93]  Yanghua Xiao,et al.  k-symmetry model for identity anonymization in social networks , 2010, EDBT '10.

[94]  Donald F. Towsley,et al.  Resisting structural re-identification in anonymized social networks , 2010, The VLDB Journal.

[95]  László Babai,et al.  Canonical labeling of graphs , 1983, STOC.

[96]  Pável Calado,et al.  Resolving user identities over social networks through supervised learning and rich similarity features , 2012, SAC '12.

[97]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[98]  Dan Suciu,et al.  Relationship privacy: output perturbation for queries with joins , 2009, PODS.

[99]  Lian Liu,et al.  Privacy Preserving in Social Networks Against Sensitive Edge Disclosure , 2008 .

[100]  Ben Taskar,et al.  Link Prediction in Relational Data , 2003, NIPS.

[101]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[102]  Omer Reingold,et al.  Computational Differential Privacy , 2009, CRYPTO.

[103]  Gopal Pandurangan,et al.  Improved Random Graph Isomorphism Tomek Czajka , 2006 .

[104]  Lei Chen,et al.  A Survey of Privacy-Preservation of Graphs and Social Networks , 2010, Managing and Mining Graph Data.

[105]  Moni Naor,et al.  On the Difficulties of Disclosure Prevention in Statistical Databases or The Case for Differential Privacy , 2010, J. Priv. Confidentiality.

[106]  D. Corneil,et al.  An Efficient Algorithm for Graph Isomorphism , 1970, JACM.

[107]  Horst Bunke,et al.  A graph distance metric based on the maximal common subgraph , 1998, Pattern Recognit. Lett..

[108]  Sándor Imre,et al.  Using Identity Separation Against De-anonymization of Social Networks , 2015, Trans. Data Priv..

[109]  Wenke Lee,et al.  xBook: Redesigning Privacy Control in Social Networking Platforms , 2009, USENIX Security Symposium.

[110]  Ashwin Machanavajjhala,et al.  No free lunch in data privacy , 2011, SIGMOD '11.

[111]  Prateek Mittal,et al.  On Your Social Network De-anonymizablity: Quantification and Large Scale Evaluation with Seed Knowledge , 2015, NDSS.

[112]  Vincent Lepetit,et al.  Keypoint recognition using randomized trees , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[113]  Hisashi Kashima,et al.  A Parameterized Probabilistic Model of Network Evolution for Supervised Link Prediction , 2006, Sixth International Conference on Data Mining (ICDM'06).

[114]  Horst Bunke,et al.  On a relation between graph edit distance and maximum common subgraph , 1997, Pattern Recognit. Lett..

[115]  Antonio Criminisi,et al.  Regression Forests for Efficient Anatomy Detection and Localization in CT Studies , 2010, MCV.

[116]  Balachander Krishnamurthy,et al.  On the leakage of personally identifiable information via online social networks , 2009, CCRV.

[117]  Chedy Raïssi,et al.  Delineating social network data anonymization via random edge perturbation , 2012, CIKM.

[118]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[119]  Rajeev Motwani,et al.  Link Privacy in Social Networks , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[120]  K. Liu,et al.  Towards identity anonymization on graphs , 2008, SIGMOD Conference.

[121]  Etienne Huens,et al.  Data for Development: the D4D Challenge on Mobile Phone Data , 2012, ArXiv.

[122]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[123]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[124]  Vicenç Torra,et al.  A formalization of re-identification in terms of compatible probabilities , 2013, ArXiv.

[125]  Sharon Goldberg,et al.  A workflow for differentially-private graph synthesis , 2012, WOSN '12.

[126]  Alex Biryukov,et al.  Trawling for Tor Hidden Services: Detection, Measurement, Deanonymization , 2013, 2013 IEEE Symposium on Security and Privacy.

[127]  Mario Vento,et al.  A (sub)graph isomorphism algorithm for matching large graphs , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[128]  Silvio Lattanzi,et al.  An efficient reconciliation algorithm for social networks , 2013, Proc. VLDB Endow..

[129]  Tsuyoshi Murata,et al.  Link Prediction based on Structural Properties of Online Social Networks , 2008, New Generation Computing.

[130]  Matthias Grossglauser,et al.  On the privacy of anonymized networks , 2011, KDD.

[131]  Kumar Sharad,et al.  Change of Guard: The Next Generation of Social Graph De-anonymization Attacks , 2016, AISec@CCS.

[132]  Shouling Ji,et al.  Structure Based Data De-Anonymization of Social Networks and Mobility Traces , 2014, ISC.

[133]  Carmela Troncoso,et al.  The bayesian traffic analysis of mix networks , 2009, CCS.

[134]  Matthias Grossglauser,et al.  A Bayesian method for matching two similar graphs without seeds , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[135]  Ben Laurie,et al.  Under)mining Privacy in Social Networks , 2008 .

[136]  Sofya Raskhodnikova,et al.  Analyzing Graphs with Node Differential Privacy , 2013, TCC.

[137]  Michael Hicks,et al.  Deanonymizing mobility traces: using social network as a side-channel , 2012, CCS.

[138]  Xiaowei Ying,et al.  On link privacy in randomizing social networks , 2010, Knowledge and Information Systems.

[139]  Ashwin Machanavajjhala,et al.  A rigorous and customizable framework for privacy , 2012, PODS.

[140]  Carmela Troncoso,et al.  Vida: How to Use Bayesian Inference to De-anonymize Persistent Communications , 2009, Privacy Enhancing Technologies.

[141]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[142]  Ilya Mironov,et al.  Differentially private recommender systems: building privacy into the net , 2009, KDD.

[143]  Cynthia Dwork,et al.  Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography , 2007, WWW '07.

[144]  Matthias Grossglauser,et al.  Growing a Graph Matching from a Handful of Seeds , 2015, Proc. VLDB Endow..

[145]  Carmela Troncoso,et al.  You cannot hide for long: de-anonymization of real-world dynamic behaviour , 2013, WPES.

[146]  Traian Marius Truta,et al.  Protection : p-Sensitive k-Anonymity Property , 2006 .

[147]  Krishna P. Gummadi,et al.  On the evolution of user interaction in Facebook , 2009, WOSN '09.

[148]  Cynthia Dwork,et al.  Differential privacy and robust statistics , 2009, STOC '09.

[149]  Johannes Gehrke,et al.  Towards Privacy for Social Networks: A Zero-Knowledge Based Definition of Privacy , 2011, TCC.

[150]  Yanchun Zhang,et al.  On the identity anonymization of high‐dimensional rating data , 2012, Concurr. Comput. Pract. Exp..

[151]  Andrew McGregor,et al.  Optimizing linear counting queries under differential privacy , 2009, PODS.

[152]  Andreas Schaad,et al.  Privacy-preserving social network analysis for criminal investigations , 2008, WPES '08.

[153]  Eamonn J. Keogh Nearest Neighbor , 2010, Encyclopedia of Machine Learning.

[154]  Catuscia Palamidessi,et al.  Geo-indistinguishability: differential privacy for location-based systems , 2012, CCS.

[155]  Christos Faloutsos,et al.  It's who you know: graph mining using recursive structural features , 2011, KDD.

[156]  Lisa Singh,et al.  Measuring Topological Anonymity in Social Networks , 2007, 2007 IEEE International Conference on Granular Computing (GRC 2007).

[157]  Sébastien Gambs,et al.  De-anonymization Attack on Geolocated Data , 2013, 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.

[158]  Bo Yang,et al.  Graph-based features for supervised link prediction , 2011, The 2011 International Joint Conference on Neural Networks.

[159]  Xiaofeng Meng,et al.  Differentially Private Set-Valued Data Release against Incremental Updates , 2013, DASFAA.

[160]  Guy N. Rothblum,et al.  Boosting and Differential Privacy , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[161]  Lise Getoor,et al.  Privacy in Social Networks: A Survey , 2011, Social Network Data Analytics.

[162]  Lise Getoor,et al.  To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles , 2009, WWW '09.

[163]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[164]  Rudolf Mathon,et al.  A Note on the Graph Isomorphism counting Problem , 1979, Inf. Process. Lett..

[165]  Kumar Sharad,et al.  True Friends Let You Down: Benchmarking Social Graph Anonymization Schemes , 2016, AISec@CCS.

[166]  Sharon Goldberg,et al.  Calibrating Data to Sensitivity in Private Data Analysis , 2012, Proc. VLDB Endow..