Learning Invariant Representations of Social Media Users

The evolution of social media users’ behavior over time complicates user-level comparison tasks such as verification, classification, clustering, and ranking. As a result, naive approaches may fail to generalize to new users or even to future observations of previously known users. In this paper, we propose a novel procedure to learn a mapping from short episodes of user activity on social media to a vector space in which the distance between points captures the similarity of the corresponding users’ invariant features. We fit the model by optimizing a surrogate metric learning objective over a large corpus of unlabeled social media content. Once learned, the mapping may be applied to users not seen at training time and enables efficient comparisons of users in the resulting vector space. We present a comprehensive evaluation to validate the benefits of the proposed approach using data from Reddit, Twitter, and Wikipedia.

[1]  Zhong Zhou,et al.  Tweet2Vec: Character-Based Distributed Representations for Social Media , 2016, ACL.

[2]  Sebastian Stier,et al.  How to Manipulate Social Media: Analyzing Political Astroturfing Using Ground Truth Data from South Korea , 2017, ICWSM.

[3]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[4]  Thamar Solorio,et al.  Sockpuppet Detection in Wikipedia: A Corpus of Real-World Deceptive Writing for Linking Identities , 2013, LREC.

[5]  Vincent Ng,et al.  Modeling Trolling in Social Media Conversations , 2018, LREC.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Maarten Sap,et al.  Developing Age and Gender Predictive Lexica over Social Media , 2014, EMNLP.

[8]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[9]  Robert F. Chew,et al.  Predicting age groups of Twitter users based on language and metadata features , 2017, PloS one.

[10]  Angela Orebaugh,et al.  Classification of Instant Messaging Communications for Forensics Analysis , 2009 .

[11]  Preslav Nakov,et al.  SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) , 2019, *SEMEVAL.

[12]  George K. Mikros,et al.  Authorship Attribution in Greek Tweets Using Author's Multilevel N-Gram Profiles , 2013, AAAI Spring Symposium: Analyzing Microtext.

[13]  Thamar Solorio,et al.  A Case Study of Sockpuppet Detection in Wikipedia , 2013 .

[14]  Mark J. T. Smith,et al.  Authorship Attribution Using a Neural Network Language Model , 2016, AAAI.

[15]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[16]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Sune Lehmann,et al.  Understanding the Demographics of Twitter Users , 2011, ICWSM.

[18]  Helen Yannakoudakis,et al.  Author Profiling for Abuse Detection , 2018, COLING.

[19]  Soroush Vosoughi,et al.  Tweet2Vec: Learning Tweet Embeddings Using Character-level CNN-LSTM Encoder-Decoder , 2016, SIGIR.

[20]  Paolo Rosso,et al.  Convolutional Neural Networks for Authorship Attribution of Short Texts , 2017, EACL.

[21]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[22]  Nir Shavit,et al.  Deep Learning is Robust to Massive Label Noise , 2017, ArXiv.

[23]  Benno Stein,et al.  Overview of PAN 2018 - Author Identification, Author Profiling, and Author Obfuscation , 2018, CLEF.

[24]  Filippo Menczer,et al.  Detection of Promoted Social Media Campaigns , 2016, ICWSM.

[25]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[26]  Ellen Spertus,et al.  Smokey: Automatic Recognition of Hostile Messages , 1997, AAAI/IAAI.

[27]  Paolo Rosso,et al.  SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter , 2019, *SEMEVAL.

[28]  Jonathan Krause,et al.  The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition , 2015, ECCV.

[29]  Sherali Zeadally,et al.  Multiple Account Identity Deception Detection in Social Media Using Nonverbal Behavior , 2014, IEEE Transactions on Information Forensics and Security.

[30]  Rachel Greenstadt,et al.  Blogs, Twitter Feeds, and Reddit Comments: Cross-domain Authorship Attribution , 2016, Proc. Priv. Enhancing Technol..

[31]  Robin Thompson,et al.  Radicalization and the Use of Social Media , 2011 .

[32]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[33]  Roy Schwartz,et al.  Authorship Attribution of Micro-Messages , 2013, EMNLP.

[34]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[35]  Stefanos Gritzalis,et al.  Identifying Authorship by Byte-Level N-Grams: The Source Code Author Profile (SCAP) Method , 2007, Int. J. Digit. EVid..

[36]  John Pavlopoulos,et al.  Deep Learning for User Comment Moderation , 2017, ALW@ACL.

[37]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[38]  Yi Yang,et al.  Toward Socially-Infused Information Extraction: Embedding Authors, Mentions, and Entities , 2016, EMNLP.

[39]  Timothy Baldwin,et al.  Text-Based Twitter User Geolocation Prediction , 2014, J. Artif. Intell. Res..

[40]  Shuly Wintner,et al.  Native Language Identification with User Generated Content , 2018, EMNLP.

[41]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[42]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Michael J. Paul,et al.  Carmen: A Twitter Geolocation System with Applications to Public Health , 2013 .

[44]  Derek Ruths,et al.  User Migration in Online Social Networks: A Case Study on Reddit During a Period of Community Unrest , 2016, ICWSM.

[45]  Aron Culotta,et al.  Predicting the Demographics of Twitter Users from Website Traffic Data , 2015, AAAI.

[46]  J. Nathan Matias,et al.  Caveat emptor, computational social science: Large-scale missing data in a widely-published Reddit corpus , 2018, PloS one.

[47]  Moshe Koppel,et al.  Determining if two documents are written by the same author , 2014, J. Assoc. Inf. Sci. Technol..

[48]  Mingzhe Wang,et al.  LINE: Large-scale Information Network Embedding , 2015, WWW.

[49]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[50]  Svitlana Volkova,et al.  Inferring Latent User Properties from Texts Published in Social Media , 2015, AAAI.

[51]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[52]  Jiebo Luo,et al.  Detecting the Hate Code on Social Media , 2017, ICWSM.

[53]  Ashton Anderson,et al.  Generalists and Specialists: Using Community Embeddings to Quantify Activity Diversity in Online Platforms , 2019, WWW.

[54]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[55]  Richard Dazeley,et al.  Authorship Attribution for Twitter in 140 Characters or Less , 2010, 2010 Second Cybercrime and Trustworthy Computing Workshop.

[56]  Mark Dredze,et al.  Learning Multiview Embeddings of Twitter Users , 2016, ACL.

[57]  Mark Stevenson,et al.  Topic or Style? Exploring the Most Useful Features for Authorship Attribution , 2018, COLING.

[58]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[59]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[60]  Ira Kemelmacher-Shlizerman,et al.  The MegaFace Benchmark: 1 Million Faces for Recognition at Scale , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Preslav Nakov,et al.  Hunting for Troll Comments in News Community Forums , 2016, ACL.

[62]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.