论文信息 - Multi-domain Alias Matching Using Machine Learning

Multi-domain Alias Matching Using Machine Learning

We describe a methodology for linking aliases belonging to the same individual based on a user's writing style (stylometric features extracted from the user generated content) and her time patterns (time-based features extracted from the publishing times of the user generated content). While most previous research on social media identity linkage relies on matching usernames, our methodology can also be used for users who actively try to choose dissimilar usernames when creating their aliases. In our experiments on a discussion forum dataset and a Twitter dataset, we evaluate the performance of three different classifiers. We use the best classifier (AdaBoost) to evaluate how well it works on different datasets using different features. Experiments show that combining stylometric and time-based features yield good results on our synthetic datasets and a small-scale evaluation on real-world blog data confirm these results, yielding a precision over 95%. The use of emotion-related and Twitter-related features yield no significant impact on the results.

[1] Patrick Juola,et al. Authorship Attribution , 2008, Found. Trends Inf. Retr..

[2] Thomas G. Dietterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[3] Fredrik Johansson,et al. Detecting multiple aliases in social media , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[4] Jasmine Novak,et al. Anti-aliasing on the web , 2004, WWW '04.

[5] Fredrik Johansson,et al. Time Profiles for Identifying Users in Online Environments , 2014, 2014 IEEE Joint Intelligence and Security Informatics Conference.

[6] Claude Castelluccia,et al. How Unique and Traceable Are Usernames? , 2011, PETS.

[7] Reza Zafarani,et al. Connecting Corresponding Identities across Communities , 2009, ICWSM.

[8] Hsinchun Chen,et al. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[9] Ramayya Krishnan,et al. HYDRA: large-scale social identity linkage via heterogeneous behavior modeling , 2014, SIGMOD Conference.

[10] Reza Zafarani,et al. Connecting users across social media sites: a behavioral-modeling approach , 2013, KDD.

[11] Mark Culp,et al. ada: An R Package for Stochastic Boosting , 2006 .