RedDust: a Large Reusable Dataset of Reddit User Traits

Social media is a rich source of assertions about personal traits, such as “I am a doctor” or “my hobby is playing tennis”. Precisely identifying explicit assertions is difficult, though, because of the users’ highly varied vocabulary and language expressions. Identifying personal traits from implicit assertions like I’ve been at work treating patients all day is even more challenging. This paper presents RedDust, a large-scale annotated resource for user profiling for over 300k Reddit users across five attributes: profession, hobby, family status, age,and gender. We construct RedDust using a diverse set of high-precision patterns and demonstrate its use as a resource for developing learning models to deal with implicit assertions. RedDust consists of users’ personal traits, which are (attribute, value) pairs, along with users’ post ids, which may be used to retrieve the posts from a publicly available crawl or from the Reddit API. We discuss the construction of the resource and show interesting statistics and insights into the data. We also compare different classifiers, which can be learned from RedDust. To the best of our knowledge, RedDust is the first annotated language resource about Reddit users at large scale. We envision further use cases of RedDust for providing background knowledge about user traits, to enhance personalized search and recommendation as well as conversational agents.

[1]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[2]  Benno Stein,et al.  Overview of PAN 2018 - Author Identification, Author Profiling, and Author Obfuscation , 2018, CLEF.

[3]  Soroush Vosoughi,et al.  Twitter Demographic Classification Using Deep Multi-modal Multi-task Learning , 2017, ACL.

[4]  Benno Stein,et al.  Overview of PAN'17 - Author Identification, Author Profiling, and Author Obfuscation , 2017, CLEF.

[5]  Mike Thelwall,et al.  She's Reddit: A source of statistically significant gendered interest information? , 2018, Inf. Process. Manag..

[6]  M. Williams,et al.  Who Tweets? Deriving the Demographic Characteristics of Age, Occupation and Social Class from Twitter User Meta-Data , 2015, PloS one.

[7]  Norbert Fuhr,et al.  Some Common Mistakes In IR Evaluation, And How They Can Be Avoided , 2018, SIGIR Forum.

[8]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[9]  Byron C. Wallace,et al.  Humans Require Context to Infer Ironic Intent (so Computers Probably do, too) , 2014, ACL.

[10]  S. Craig Finlay,et al.  Age and Gender in Reddit Commenting and Success , 2014 .

[11]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[12]  Загоровская Ольга Владимировна,et al.  Исследование влияния пола и психологических характеристик автора на количественные параметры его текста с использованием программы Linguistic Inquiry and Word Count , 2015 .

[13]  Teresa Gonçalves,et al.  Age and Gender Classification of Tweets Using Convolutional Neural Networks , 2017, MOD.

[14]  Benjamin Fabian,et al.  Privacy on Reddit? Towards Large-scale User Classification , 2015, ECIS.

[15]  Sunghwan Mac Kim,et al.  Demographic Inference on Twitter using Recursive Neural Networks , 2017, ACL.

[16]  Jan Snajder,et al.  Reddit: A Gold Mine for Personality Prediction , 2018, PEOPLES@NAACL-HTL.

[17]  Nikolaos Aletras,et al.  An analysis of the user occupational class through Twitter content , 2015, ACL.

[18]  Lyle H. Ungar,et al.  Exploring Stylistic Variation with Age and Income on Twitter , 2016, ACL.

[19]  Maarten Sap,et al.  Developing Age and Gender Predictive Lexica over Social Media , 2014, EMNLP.

[20]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21]  Malvina Nissim,et al.  N-GrAM: New Groningen Author-profiling Model , 2017, CLEF.

[22]  Benno Stein,et al.  Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter , 2017, CLEF.

[23]  Ana-Maria Popescu,et al.  A Machine Learning Approach to Twitter User Classification , 2011, ICWSM.

[24]  Lyle H. Ungar,et al.  User-Level Race and Ethnicity Predictors from Twitter Text , 2018, COLING.

[25]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[26]  Cecilia Ovesdotter Alm,et al.  An Analysis of Domestic Abuse Discourse on Reddit , 2015, EMNLP.

[27]  Gerhard Weikum,et al.  Listening between the Lines: Learning Personal Attributes from Conversations , 2019, WWW.

[28]  Amit P. Sheth,et al.  Personalized Health Knowledge Graph , 2018, CKGSemStats@ISWC.

[29]  Lyle H. Ungar,et al.  Analyzing Biases in Human Perception of User Age and Gender from Text , 2016, ACL.

[30]  Manik Varma Extreme Classification: Tagging on Wikipedia, Recommendation on Amazon & Advertising on Bing , 2018, WWW.

[31]  Lyle H. Ungar,et al.  Beyond Binary Labels: Political Ideology Prediction of Twitter Users , 2017, ACL.