A Unified Framework for User Identification across Online and Offline Data

User identification across multiple datasets has a wide range of applications and there has been an increasing set of research works on this topic during recent years. However, most of existing works focus on user identification with a single input data type, e.g., (I) identifying a user across multiple social networks with online data and (II) detecting a single user from heterogeneous trajectory datasets with offline data. Different from previous works, in this paper, we propose a framework on user identification between online and offline datasets. We build connections between these two types of data by a mapping from IP addresses to physical locations. To solve this problem, we propose a novel framework consists of three steps. First, we use a clustering method based on locations of IP addresses to map IP addresses into specific physical location distributions. Second, we propose a novel pairwise index to reduce space cost and running time for computing the co-occurrence. Lastly, we apply a learning-to-rank method to merge the effect of multiple features we get in the first two steps. Based on our framework, we design experiments to demonstrate the efficiency (in time and space) of our framework, together with the precision and recall of our approach compared to other methods.

[1]  Christos Faloutsos,et al.  V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[2]  Martin Vetterli,et al.  Where You Are Is Who You Are: User Identification by Matching Statistics , 2015, IEEE Transactions on Information Forensics and Security.

[3]  Qing He,et al.  From Online Behaviors to Offline Retailing , 2016, KDD.

[4]  Lei Chen,et al.  On The Marriage of Lp-norms and Edit Distance , 2004, VLDB.

[5]  Wei Zhang,et al.  EM algorithms of Gaussian mixture model and hidden Markov model , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[6]  Xiaoping Zhou,et al.  Cross-Platform Identification of Anonymous Identical Users in Multiple Social Media Networks , 2016, IEEE Transactions on Knowledge and Data Engineering.

[7]  Hui Zang,et al.  Anonymization of location data does not work: a large-scale measurement study , 2011, MobiCom.

[8]  Mirco Musolesi,et al.  Spatio-temporal techniques for user identification by means of GPS mobility data , 2015, EPJ Data Science.

[9]  Tong Zhang,et al.  Crowd Fraud Detection in Internet Advertising , 2015, WWW.

[10]  Christos Faloutsos,et al.  Inferring lockstep behavior from connectivity pattern in large graphs , 2016, Knowledge and Information Systems.

[11]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Ramayya Krishnan,et al.  HYDRA: large-scale social identity linkage via heterogeneous behavior modeling , 2014, SIGMOD Conference.

[14]  Krishna P. Gummadi,et al.  On the Reliability of Profile Matching Across Large Online Social Networks , 2015, KDD.

[15]  Venkatesan Guruswami,et al.  CopyCatch: stopping group attacks by spotting lockstep behavior in social networks , 2013, WWW.

[16]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[17]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[18]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[19]  César A. Hidalgo,et al.  Unique in the Crowd: The privacy bounds of human mobility , 2013, Scientific Reports.

[20]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[21]  Y. de Montjoye,et al.  Unique in the shopping mall: On the reidentifiability of credit card metadata , 2015, Science.

[22]  Silvio Lattanzi,et al.  Linking Users Across Domains with Location Data: Theory and Validation , 2016, WWW.

[23]  Wei Cao,et al.  Automatic user identification method across heterogeneous mobility data sources , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[24]  Martin Werner BACR: set similarities with lower bounds and application to spatial trajectories , 2015, SIGSPATIAL/GIS.

[25]  Philip S. Yu,et al.  Inferring anchor links across multiple heterogeneous social networks , 2013, CIKM.

[26]  Xing Xie,et al.  Mining user similarity based on location history , 2008, GIS '08.

[27]  Jayakrishnan Unnikrishnan,et al.  De-anonymizing private data by matching statistics , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[28]  P. Deb Finite Mixture Models , 2008 .

[29]  Erik G. Hoel,et al.  Spatial indexing and analytics on Hadoop , 2014, SIGSPATIAL/GIS.

[30]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[31]  Longbo Huang,et al.  User identification in cyber-physical space: a case study on mobile query logs and trajectories , 2016, SIGSPATIAL/GIS.

[32]  Albert-László Barabási,et al.  Understanding individual human mobility patterns , 2008, Nature.

[33]  Chun Chen,et al.  Mapping Users across Networks by Manifold Alignment on Hypergraph , 2014, AAAI.

[34]  Heng Tao Shen,et al.  Searching trajectories by locations: an efficiency study , 2010, SIGMOD Conference.

[35]  Vincent Y. Shen,et al.  User identification across multiple social networks , 2009, 2009 First International Conference on Networked Digital Technologies.