Modeling User Intrinsic Characteristic on Social Media for Identity Linkage

Most users on social media have intrinsic characteristics, such as interests and political views, that can be exploited to identify and track them, thus raising privacy and identity concerns in online communities. In this article, we investigate the problem of user identity linkage on two behavior datasets collected from different experiments. Specifically, we focus on user linkage based on users’ interaction behaviors with respect to content topics. We propose an embedding method to model a topic as a vector in a latent space to interpret its deep semantics. Then a user is modeled as a vector based on his or her interactions with topics. The embedding representations of topics are learned by optimizing the joint-objective: the compatibility between topics with similar semantics, the discriminative abilities of topics to distinguish identities, and the consistency of the same user’s characteristics from two datasets. The effectiveness of our method is verified on real-life datasets and the results show that it outperforms related methods. We also analyze failure cases in the application of our identity linkage method. Our analysis shows that factors such as the visibility and variance of user behaviors and users’ group psychology can result in mis-linkages. We also analyze the details of the behaviors of some representative users to understand the essential reasons for their identity being mis-linked. We find that these users have high variance level in their behaviors. According to the above experimental results, we introduce a confidence score into identity linkage to provide information about the accuracy of the method results.

[1]  Silvio Lattanzi,et al.  An efficient reconciliation algorithm for social networks , 2013, Proc. VLDB Endow..

[2]  Martin Vetterli,et al.  Where You Are Is Who You Are: User Identification by Matching Statistics , 2015, IEEE Transactions on Information Forensics and Security.

[3]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[4]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[7]  Elisa Bertino,et al.  Modeling User Intrinsic Characteristic on Social Media for Identity Linkage , 2018, GROUP.

[8]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[9]  John Riedl,et al.  The Tag Genome: Encoding Community Knowledge to Support Novel Interaction , 2012, TIIS.

[10]  Hui Zang,et al.  Anonymization of location data does not work: a large-scale measurement study , 2011, MobiCom.

[11]  Benno Stein,et al.  Plagiarism analysis, authorship identification, and near-duplicate detection PAN'07 , 2007, SIGF.

[12]  Elisa Bertino,et al.  RahasNym: Pseudonymous Identity Management System for Protecting against Linkability , 2016, 2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC).

[13]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[14]  Elisa Bertino,et al.  Efficient k -Anonymization Using Clustering Techniques , 2007, DASFAA.

[15]  David Buttler,et al.  Exploring Topic Coherence over Many Models and Many Topics , 2012, EMNLP.

[16]  Jayakrishnan Unnikrishnan,et al.  De-anonymizing private data by matching statistics , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[17]  Elad Yom-Tov,et al.  Serial Sharers: Detecting Split Identities of Web Authors , 2007, PAN.

[18]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[19]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[20]  Bartunov Sergey,et al.  Joint Link-Attribute User Identity Resolution in Online Social Networks , 2012 .

[21]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[22]  George Danezis,et al.  GENERAL TERMS , 2003 .

[23]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[24]  Yizhou Sun,et al.  Entity Embedding-Based Anomaly Detection for Heterogeneous Categorical Events , 2016, IJCAI.

[25]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[26]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[27]  David A. Shamma,et al.  Understanding Online Reviews: Funny, Cool or Useful? , 2015, CSCW.

[28]  Lior Rokach,et al.  Entity Matching in Online Social Networks , 2013, 2013 International Conference on Social Computing.

[29]  Siyuan Liu,et al.  Structured Learning from Heterogeneous Behavior for Social Identity Linkage , 2015, IEEE Transactions on Knowledge and Data Engineering.

[30]  Vitaly Shmatikov,et al.  De-anonymizing Social Networks , 2009, 2009 30th IEEE Symposium on Security and Privacy.

[31]  Sébastien Gambs,et al.  De-anonymization Attack on Geolocated Data , 2013, 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.