UGCLink: User Identity Linkage by Modeling User Generated Contents with Knowledge Distillation

User identity linkage aims to link users with the same identities across different social networks. Recently, re- searchers model the similarities of users’ behaviors such as Point of Interests(PoIs) or User Generated Contents(UGCs) to predict the identities of users. However, it is non-trivial to solve the problem due to the following challenges: 1) PoIs are always sparse in the non-location-based social platforms, and it is impractical to measure the similarities of users solely with PoIs; 2) The similarities of hierarchical are hierarchical from the view of word, phrase, and sentence. How to model the hierarchical structure remains a key challenge; 3) The unreliable semantics of words. Two different words may refer to the same physical appearance of users, indicating that users are with the same identities.To tackle the above problems, we propose UGCLink, a knowledge distillation framework that models UGCs to predict user identities. Two main components are included in the framework, where the student network aims to model the similarities of UGCs and the teacher network guides the student network to learn better word embeddings that reveal the physical appearance of users. Besides, the teacher network, a document classification model that classifies UGCs into the categories of PoIs, is trained to guide the word embedding learning process in the student network to circumvent the unreliable semantic problem. We demonstrate that our proposed method outperforms the state- of-the-art methods by more than 11% in terms of AUC score.