Word Embeddings for User Profiling in Online Social Networks

User profiling in social networks can besignificantly augmented by using available full-text itemssuch as posts or statuses and ratings (in the form oflikes) that users give them. In this work, we applymodern natural language processing techniques basedon word embeddings to several problems related touser profiling in social networks. First, we present anapproach to create user profiles that measure a user’sinterest in various topics mined from the full texts of theitems. As a result, we get a user profile that can be used,e.g., for cold start recommendations for items, targetedadvertisement, and other purposes; our experimentsshow that the interests mining method performs on alevel comparable with collaborative algorithms while atthe same time being a cold start approach, i.e., itdoes not use the likes of an item being recommended.Second, we study the problem of predicting a user’sdemographic attributes such as age and gender basedon his or her full-text items. We evaluate theefficiency of various age prediction algorithms based onword2vec word embeddings and conduct an extensiveexperimental evaluation, comparing these algorithmswith each other and with classical baseline approaches.

[1]  Andrew Y. Ng,et al.  Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[2]  Xuanjing Huang,et al.  Learning Context-Sensitive Word Embeddings with Neural Tensor Skip-Gram Model , 2015, IJCAI.

[3]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[4]  Jianfeng Gao,et al.  Modeling Interestingness with Deep Neural Networks , 2014, EMNLP.

[5]  Benno Stein,et al.  Overview of the 3rd Author Profiling Task at PAN 2015 , 2015, CLEF.

[6]  Benno Stein,et al.  Overview of the Author Profiling Task at PAN 2013 , 2013, CLEF.

[7]  Zhaohui Wu,et al.  Sense-Aaware Semantic Analysis: A Multi-Prototype Word Representation Model Using Wikipedia , 2015, AAAI.

[8]  Devdatt P. Dubhashi,et al.  Extractive Summarization using Continuous Vector Space Models , 2014, CVSC@EACL.

[9]  Ted Pedersen,et al.  Screening Twitter Users for Depression and PTSD with Lexical Decision Lists , 2015, CLPsych@HLT-NAACL.

[10]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Sergey I. Nikolenko,et al.  User Profiling in Text-Based Recommender Systems Based on Distributed Word Representations , 2016, AIST.

[13]  Pasquale Lops,et al.  Content-based Recommender Systems: State of the Art and Trends , 2011, Recommender Systems Handbook.

[14]  Svitlana Volkova,et al.  Inferring User Political Preferences from Streaming Communications , 2014, ACL.

[15]  Jason Baldridge,et al.  Hierarchical Discriminative Classification for Text-Based Geolocation , 2014, EMNLP.

[16]  Teresa Gonçalves,et al.  Author Profiling using SVMs and Word Embedding Averages , 2016, CLEF.

[17]  Benno Stein,et al.  Overview of the 2 nd Author Profiling Task at PAN 2014 , 2014 .

[18]  Felice Dell'Orletta,et al.  Linguistic Profiling based on General-purpose Features and Native Language Identification , 2013, BEA@NAACL-HLT.

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  Gemma Boleda,et al.  Regular polysemy: A distributional model , 2012, *SEM@NAACL-HLT.

[21]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[22]  Jon Oberlander,et al.  Whose Thumb Is It Anyway? Classifying Author Personality from Weblog Text , 2006, ACL.

[23]  Paolo Rosso,et al.  On the impact of emotions on author profiling , 2016, Inf. Process. Manag..

[24]  Olav Bjørkøy USER MODELING ON THE WEB An Exploratory Review of Recommendation Systems , 2010 .

[25]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[26]  Gerhard Fischer,et al.  User Modeling in Human–Computer Interaction , 2001, User Modeling and User-Adapted Interaction.

[27]  Michael J. Pazzani,et al.  Content-Based Recommendation Systems , 2007, The Adaptive Web.

[28]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[29]  Raymond J. Mooney,et al.  Multi-Prototype Vector-Space Models of Word Meaning , 2010, NAACL.

[30]  Derek Ruths,et al.  Classifying Political Orientation on Twitter: It's Not Easy! , 2013, ICWSM.

[31]  Rich Ling,et al.  The socio-demographics of texting: An analysis of traffic data , 2012, New Media Soc..

[32]  Michael J. Pazzani,et al.  User Modeling for Adaptive News Access , 2000, User Modeling and User-Adapted Interaction.

[33]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[34]  Benjamin Van Durme,et al.  Using Conceptual Class Attributes to Characterize Social Media Users , 2013, ACL.

[35]  J. Pennebaker,et al.  LEXICAL PREDICTORS OFPERSONALITY TYPE , 2005 .

[36]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[37]  Jussi Karlgren,et al.  Inferring the location of authors from words in their texts , 2015, NODALIDA.

[38]  Timothy Baldwin,et al.  Twitter User Geolocation Using a Unified Text and Network Prediction Model , 2015, ACL.

[39]  Geoffrey I. Webb,et al.  # 2001 Kluwer Academic Publishers. Printed in the Netherlands. Machine Learning for User Modeling , 1999 .

[40]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[41]  Michael J. Pazzani,et al.  Learning and Revising User Profiles: The Identification of Interesting Web Sites , 1997, Machine Learning.

[42]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[44]  Nikolaos Aletras,et al.  An analysis of the user occupational class through Twitter content , 2015, ACL.

[45]  Alfred Kobsa,et al.  The Adaptive Web, Methods and Strategies of Web Personalization , 2007, The Adaptive Web.

[46]  William W. Cohen,et al.  Recommendation as Classification: Using Social and Content-Based Information in Recommendation , 1998, AAAI/IAAI.

[47]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[48]  Eric Brown,et al.  Applying natural language processing (NLP) based metadata extraction to automatically acquire user preferences , 2001, K-CAP '01.

[49]  John W. Sheppard,et al.  Comparing Frequency- and Style-Based Features for Twitter Author Identification , 2013, FLAIRS.

[50]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[51]  Yoav Goldberg,et al.  A Primer on Neural Network Models for Natural Language Processing , 2015, J. Artif. Intell. Res..

[52]  Malvina Nissim,et al.  GronUP: Groningen User Profiling: Notebook for PAN at CLEF 2016 , 2016 .

[53]  Weiran Xu,et al.  Learning Word Vectors Efficiently Using Shared Representations and Document Representations , 2015, AAAI.

[54]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[55]  Trevor Cohn,et al.  A user-centric model of voting intention from Social Media , 2013, ACL.

[56]  Stuart E. Middleton,et al.  Ontological user profiling in recommender systems , 2004, TOIS.

[57]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[58]  Oleksandr Frei,et al.  Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections , 2015, TM@CIKM.

[59]  Hao Wu,et al.  Hierarchical Neural Language Models for Joint Representation of Streaming Documents and their Content , 2015, WWW.

[60]  Koray Kavukcuoglu,et al.  Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[61]  Rong Zheng,et al.  Authorship Analysis in Cybercrime Investigation , 2003, ISI.

[62]  Ting Liu,et al.  Learning Semantic Representations of Users and Products for Document Level Sentiment Classification , 2015, ACL.

[63]  Charu C. Aggarwal,et al.  Graphical models for text: a new paradigm for text representation and processing , 2010, SIGIR '10.

[64]  Benno Stein,et al.  Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations , 2016, CLEF.

[65]  Irene Kotsia,et al.  Max-margin Non-negative Matrix Factorization , 2012, Image Vis. Comput..

[66]  Fang Kong,et al.  Collective Personal Profile Summarization with Social Networks , 2013, EMNLP.

[67]  Dmitry Ustalov,et al.  RUSSE: The First Workshop on Russian Semantic Similarity , 2015, ArXiv.

[68]  Iryna Gurevych,et al.  Personality Profiling of Fictional Characters using Sense-Level Links between Lexical Resources , 2015, EMNLP.

[69]  Eduard H. Hovy,et al.  Weakly Supervised User Profile Extraction from Twitter , 2014, ACL.

[70]  Zhiyuan Liu,et al.  Topical Word Embeddings , 2015, AAAI.

[71]  Geoffrey Zweig,et al.  Polarity Inducing Latent Semantic Analysis , 2012, EMNLP.

[72]  Benno Stein,et al.  Overview of the Author Identification Task at PAN-2017: Style Breach Detection and Author Clustering , 2017, CLEF.

[73]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[74]  Timothy Baldwin,et al.  A Stacking-based Approach to Twitter User Geolocation Prediction , 2013, ACL.

[75]  Sergey I. Nikolenko,et al.  Predicting the age of social network users from user-generated texts with word embeddings , 2016, 2016 IEEE Artificial Intelligence and Natural Language Conference (AINL).

[76]  Jun Guo,et al.  A Study on the CBOW Model's Overfitting and Stability , 2014, Web-KR '14.

[77]  Timothy Baldwin,et al.  Exploiting Text and Network Context for Geolocation of Social Media Users , 2015, NAACL.

[78]  Liu Yang,et al.  Mining User Relations from Online Discussions using Sentiment Analysis and Probabilistic Matrix Factorization , 2013, NAACL.

[79]  Christopher Meek,et al.  Semantic Parsing for Single-Relation Question Answering , 2014, ACL.

[80]  Steven Skiena,et al.  Polyglot: Distributed Word Representations for Multilingual NLP , 2013, CoNLL.

[81]  Diana Inkpen,et al.  Estimating User Location in Social Media with Stacked Denoising Auto-encoders , 2015, VS@HLT-NAACL.

[82]  Shlomo Argamon,et al.  Authorship attribution with thousands of candidate authors , 2006, SIGIR.

[83]  Inderjeet Mani,et al.  Using NLP for Machine Learning of User Profiles , 1998, Intell. Data Anal..

[84]  S. Pham,et al.  Profiling for English Emails , 2007 .

[85]  Zachary Miller,et al.  Gender Identification on Twitter Using the Modified Balanced Winnow , 2012 .