Age Prediction of Spanish-speaking Twitter Users

Age prediction in Twitter is an interesting but challenging task, that arises as a way to improving online marketing and potentially helping with the detection of cyber-pedophiles who pretend to be younger users by using fake profiles. In this work, we focus the analysis on Twitter users writing in Spanish. As any author profiling task, age prediction greatly depends on the language used by the target group. In the case of Spanish, one of the biggest difficulties is the lack of a labeled corpus. Hence, we explore strategies to generate it and, as a result, we develop TweetLab, a software pipeline to extract and label Twitter and customize it for users in Spanish from Uruguay and part of Argentina. Another identified problem is the short nature of the tweets. Therefore, it is necessary to gather as many information as possible from them, even by inferring hidden relations or calculating lexical metrics. In order to do that, we study three types of features: user metadata, stylometric features from tweets text and Natural Language Processing features extracted from tweets as well as subscription lists, which contain information about the user’s interests. We also present a novel set of features that model the presence of other social networks profiles linked to the Twitter account. Those extracted features are used to build models which are used as input of Machine Learning algorithms, in order to predict the age of the users and classify them into the age groups defined. We run several experiments with different datasets and algorithms. The experimental results show that these features work well in detection of users age.

[1]  Max Coltheart,et al.  The MRC Psycholinguistic Database , 1981 .

[2]  Angela Dorothy Glover,et al.  AUTOMATICALLY DETECTING STYLISTIC INCONSISTENCIES IN COMPUTER-SUPPORTED COLLABORATIVE WRITING , 1996 .

[3]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[4]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[5]  Graeme Hirst,et al.  Segmenting documents by stylistic character , 2005, Natural Language Engineering.

[6]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[7]  N. Pendar Toward Spotting the Pedophile Telling victim from predator in text chats , 2007 .

[8]  Graeme Hirst,et al.  Bigrams of Syntactic Labels for Authorship Discrimination of Short Texts , 2007, Lit. Linguistic Comput..

[9]  James Caverlee,et al.  A Large-Scale Study of MySpace: Observations and Implications for Online Social Networks , 2021, ICWSM.

[10]  Federica Barbieri Patterns of age-based linguistic variation in American English , 2008 .

[11]  Craig H. Martell,et al.  Age Detection in Chat , 2009, 2009 IEEE International Conference on Semantic Computing.

[12]  Sudeshna Sarkar,et al.  Stylometric Analysis of Bloggers' Age and Gender , 2009, ICWSM.

[13]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[14]  D. Rao Detecting Latent User Properties in Social Media , 2010 .

[15]  Walter Daelemans,et al.  Predicting age and gender in online social networks , 2011, SMUC '11.

[16]  Sara Rosenthal,et al.  Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations , 2011, ACL.

[17]  Wendy Liu,et al.  Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors , 2012, ICWSM.

[18]  Dong Nguyen,et al.  "How Old Do You Think I Am?" A Study of Language and Age in Twitter , 2013, ICWSM.

[19]  Marie-Francine Moens,et al.  Age and Gender Identification in Social Media , 2014, CLEF.

[20]  Dong Nguyen,et al.  Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment , 2014, COLING.

[21]  Golnoosh Farnadi,et al.  Age, Gender and Personality Recognition using Tweets in a Multilingual setting: Notebook for PAN at CLEF 2015 , 2015, CLEF.

[22]  Steven Skiena,et al.  Exact Age Prediction in Social Networks , 2015, WWW.

[23]  Robert F. Chew,et al.  Predicting age groups of Twitter users based on language and metadata features , 2017, PloS one.