Linking Users Across Domains with Location Data: Theory and Validation

Linking accounts of the same user across datasets -- even when personally identifying information is removed or unavailable -- is an important open problem studied in many contexts. Beyond many practical applications, (such as cross domain analysis, recommendation, and link prediction), understanding this problem more generally informs us on the privacy implications of data disclosure. Previous work has typically addressed this question using either different portions of the same dataset or observing the same behavior across thematically similar domains. In contrast, the general cross-domain case where users have different profiles independently generated from a common but unknown pattern raises new challenges, including difficulties in validation, and remains under-explored. In this paper, we address the reconciliation problem for location-based datasets and introduce a robust method for this general setting. Location datasets are a particularly fruitful domain to study: such records are frequently produced by users in an increasing number of applications and are highly sensitive, especially when linked to other datasets. Our main contribution is a generic and self-tunable algorithm that leverages any pair of sporadic location-based datasets to determine the most likely matching between the users it contains. While making very general assumptions on the patterns of mobile users, we show that the maximum weight matching we compute is provably correct. Although true cross-domain datasets are a rarity, our experimental evaluation uses two entirely new data collections, including one we crawled, on an unprecedented scale. The method we design outperforms naive rules and prior heuristics. As it combines both sparse and dense properties of location-based data and accounts for probabilistic dynamics of observation, it can be shown to be robust even when data gets sparse.

[1]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[2]  Krishna P. Gummadi,et al.  On the Reliability of Profile Matching Across Large Online Social Networks , 2015, KDD.

[3]  Ying Wang,et al.  Algorithms for Large, Sparse Network Alignment Problems , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[4]  Diane J. Cook,et al.  Ambient and smartphone sensor assisted ADL recognition in multi-inhabitant smart environments , 2016, J. Ambient Intell. Humaniz. Comput..

[5]  Steven M. Bellovin,et al.  – Revealing the Demographics of Location Data , 2015 .

[6]  Franco Zambonelli,et al.  Re-identification and information fusion between anonymized CDR and social network data , 2015, Journal of Ambient Intelligence and Humanized Computing.

[7]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[8]  Michael Hicks,et al.  Deanonymizing mobility traces: using social network as a side-channel , 2012, CCS.

[9]  Jure Leskovec,et al.  Friendship and mobility: user movement in location-based social networks , 2011, KDD.

[10]  Hui Zang,et al.  Anonymization of location data does not work: a large-scale measurement study , 2011, MobiCom.

[11]  Mirco Musolesi,et al.  It's the way you check-in: identifying users in location-based social networks , 2014, COSN '14.

[12]  Sree Hari Krishnan Parthasarathi,et al.  Exploiting innocuous activity for correlating users across sites , 2013, WWW.

[13]  Silvio Lattanzi,et al.  An efficient reconciliation algorithm for social networks , 2013, Proc. VLDB Endow..

[14]  Dan Cosley,et al.  Inferring social ties from geographic coincidences , 2010, Proceedings of the National Academy of Sciences.

[15]  Jayakrishnan Unnikrishnan,et al.  Asymptotically Optimal Matching of Multiple Sequences to Source Distributions and Training Sequences , 2014, IEEE Transactions on Information Theory.

[16]  Matthias Grossglauser,et al.  On the privacy of anonymized networks , 2011, KDD.

[17]  Shouling Ji,et al.  Structure Based Data De-Anonymization of Social Networks and Mobility Traces , 2014, ISC.

[18]  Philip S. Yu,et al.  Transferring heterogeneous links across location-based social networks , 2014, WSDM.

[19]  Matthias Grossglauser,et al.  Growing a Graph Matching from a Handful of Seeds , 2015, Proc. VLDB Endow..

[20]  César A. Hidalgo,et al.  Unique in the Crowd: The privacy bounds of human mobility , 2013, Scientific Reports.

[21]  Matthias Grossglauser,et al.  On the performance of percolation graph matching , 2013, COSN '13.

[22]  Jayakrishnan Unnikrishnan,et al.  De-anonymizing private data by matching statistics , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[23]  Y. de Montjoye,et al.  Unique in the shopping mall: On the reidentifiability of credit card metadata , 2015, Science.

[24]  Marco Mamei,et al.  Re-identification of anonymized CDR datasets using social network data , 2014, 2014 IEEE International Conference on Pervasive Computing and Communication Workshops (PERCOM WORKSHOPS).

[25]  Nicholas Jing Yuan,et al.  You Are Where You Go: Inferring Demographic Attributes from Location Check-ins , 2015, WSDM.

[26]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[27]  Danai Koutra,et al.  BIG-ALIGN: Fast Bipartite Graph Alignment , 2013, 2013 IEEE 13th International Conference on Data Mining.

[28]  Steven M. Bellovin,et al.  "I don't have a photograph, but you can have my footprints.": Revealing the Demographics of Location Data , 2015, ICWSM.

[29]  Vitaly Shmatikov,et al.  De-anonymizing Social Networks , 2009, 2009 30th IEEE Symposium on Security and Privacy.