Where You Are Is Who You Are: User Identification by Matching Statistics

Most users of online services have unique behavioral or usage patterns. These behavioral patterns can be exploited to identify and track users by using only the observed patterns in the behavior. We study the task of identifying users from statistics of their behavioral patterns. In particular, we focus on the setting in which we are given histograms of users' data collected during two different experiments. We assume that, in the first data set, the users' identities are anonymized or hidden and that, in the second data set, their identities are known. We study the task of identifying the users by matching the histograms of their data in the first data set with the histograms from the second data set. In recent works, the optimal algorithm for this user identification task is introduced. In this paper, we evaluate the effectiveness of this method on three different types of data sets with up to 50 000 users, and in multiple scenarios. Using data sets such as call data records, web browsing histories, and GPS trajectories, we demonstrate that a large fraction of users can be easily identified given only histograms of their data; hence, these histograms can act as users' fingerprints. We also verify that simultaneous identification of users achieves better performance compared with one-by-one user identification. Furthermore, we show that using the optimal method for identification indeed gives higher identification accuracy than the heuristics-based approaches in the practical scenarios. The accuracy obtained under this optimal method can thus be used to quantify the maximum level of user identification that is possible in such settings. We show that the key factors affecting the accuracy of the optimal identification algorithm are the duration of the data collection, the number of users in the anonymized data set, and the resolution of the data set. We also analyze the effectiveness of k-anonymization in resisting user identification attacks on these data sets.

[1]  Jayakrishnan Unnikrishnan,et al.  Asymptotically Optimal Matching of Multiple Sequences to Source Distributions and Training Sequences , 2014, IEEE Transactions on Information Theory.

[2]  Matthias Grossglauser,et al.  On the privacy of anonymized networks , 2011, KDD.

[3]  George Danezis,et al.  De-anonymizing D 4 D Datasets , 2013 .

[4]  Jayakrishnan Unnikrishnan,et al.  De-anonymizing private data by matching statistics , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[5]  Ariel Stolerman,et al.  Breaking the Closed-World Assumption in Stylometric Authorship Attribution , 2014, IFIP Int. Conf. Digital Forensics.

[6]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[7]  Reza Shokri,et al.  Evaluating the Privacy Risk of Location-Based Services , 2011, Financial Cryptography.

[8]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[9]  Salil S. Kanhere,et al.  A survey on privacy in mobile participatory sensing applications , 2011, J. Syst. Softw..

[10]  Michael Hicks,et al.  Deanonymizing mobility traces: using social network as a side-channel , 2012, CCS.

[11]  Ariel Stolerman,et al.  Doppelgänger Finder: Taking Stylometry to the Underground , 2014, 2014 IEEE Symposium on Security and Privacy.

[12]  Carmela Troncoso,et al.  Vida: How to Use Bayesian Inference to De-anonymize Persistent Communications , 2009, Privacy Enhancing Technologies.

[13]  Georgios Tziritas,et al.  Successive Group Selection for Microaggregation , 2013, IEEE Transactions on Knowledge and Data Engineering.

[14]  Carmela Troncoso,et al.  Perfect Matching Disclosure Attacks , 2008, Privacy Enhancing Technologies.

[15]  Fan Zhang,et al.  What's in a name?: an unsupervised approach to link users across communities , 2013, WSDM.

[16]  Dmitri V. Kalashnikov,et al.  Web People Search via Connection Analysis , 2008, IEEE Transactions on Knowledge and Data Engineering.

[17]  Sébastien Gambs,et al.  De-anonymization Attack on Geolocated Data , 2013, 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.

[18]  Xing Xie,et al.  Finding similar users using category-based location history , 2010, GIS '10.

[19]  L Sweeney,et al.  Weaving Technology and Policy Together to Maintain Confidentiality , 1997, Journal of Law, Medicine & Ethics.

[20]  César A. Hidalgo,et al.  Unique in the Crowd: The privacy bounds of human mobility , 2013, Scientific Reports.

[21]  Hui Zang,et al.  Anonymization of location data does not work: a large-scale measurement study , 2011, MobiCom.

[22]  Etienne Huens,et al.  Data for Development: the D4D Challenge on Mobile Phone Data , 2012, ArXiv.

[23]  J. Pach,et al.  Wiley‐Interscience Series in Discrete Mathematics and Optimization , 2011 .

[24]  Robert E. Tarjan,et al.  A Weight-Scaling Algorithm for Min-Cost Imperfect Matchings in Bipartite Graphs , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[25]  Seth Pettie,et al.  Linear-Time Approximation for Maximum Weight Matching , 2014, JACM.

[26]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[27]  Lior Rokach,et al.  Entity Matching in Online Social Networks , 2013, 2013 International Conference on Social Computing.

[28]  Paul Resnick,et al.  Recommender systems , 1997, CACM.

[29]  Sébastien Gambs,et al.  De-anonymization attack on geolocated data , 2014, J. Comput. Syst. Sci..

[30]  David K. Y. Yau,et al.  Privacy vulnerability of published anonymous mobility traces , 2010, MobiCom.

[31]  Wen Hu,et al.  Preserving privacy in participatory sensing systems , 2010, Comput. Commun..

[32]  Martín Abadi,et al.  Host Fingerprinting and Tracking on the Web: Privacy and Security Implications , 2012, NDSS.

[33]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[34]  Jean-Yves Le Boudec,et al.  Quantifying Location Privacy , 2011, 2011 IEEE Symposium on Security and Privacy.

[35]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[36]  PettieSeth,et al.  Linear-Time Approximation for Maximum Weight Matching , 2014 .

[37]  Dan Roth,et al.  Understanding the Value of Features for Coreference Resolution , 2008, EMNLP.

[38]  Josep Domingo-Ferrer,et al.  A polynomial-time approximation to optimal multivariate microaggregation , 2008, Comput. Math. Appl..

[39]  Josep Domingo-Ferrer,et al.  Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[40]  George Danezis,et al.  GENERAL TERMS , 2003 .

[41]  Wei-Ying Ma,et al.  Understanding mobility based on GPS data , 2008, UbiComp.

[42]  Philip S. Yu,et al.  A General Survey of Privacy-Preserving Data Mining Models and Algorithms , 2008, Privacy-Preserving Data Mining.

[43]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[44]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[45]  Robert E. Tarjan,et al.  Fibonacci heaps and their uses in improved network optimization algorithms , 1987, JACM.

[46]  Claude Castelluccia,et al.  On the uniqueness of Web browsing history patterns , 2014, Ann. des Télécommunications.