Building Topic Models to Predict Author Attributes from Twitter Messages

We use the topic modeling software package MALLET [10] to construct models of 100 topics each for the four languages in the scope of the PAN’15 Author Profiling task. The topics in these models are essentially groups of words that may be semantically related and are frequently observed near each other in a collection of training documents. To ensure we had a sufficiently large body of examples to build such models, we collected our own corpora of Twitter messages in English, Spanish, Italian and Dutch. We also use MALLET to infer the most likely distribution over the generated topics that could have produced any given tweet instance, allowing us to represent tweets as concise 100-element document-topic distribution vectors. These representations serve as inputs to a set of classifiers that make predictions for unknown authors’ age, gender, extroversion, stability, agreeableness, conscientiousness, and openness.

[1]  José Palazzo Moreira de Oliveira,et al.  Examining Multiple Features for Author Profiling , 2014, J. Inf. Data Manag..

[2]  Benno Stein,et al.  Overview of the Author Profiling Task at PAN 2013 , 2013, CLEF.

[3]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[5]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[8]  José Palazzo Moreira de Oliveira,et al.  Exploring Information Retrieval Features for Author Profiling , 2014, CLEF.

[9]  Benno Stein,et al.  Ousting ivory tower research: towards a web framework for providing experiments as a service , 2012, SIGIR '12.

[10]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[11]  Hugo Jair Escalante,et al.  Using Intra-Profile Information for Author Profiling , 2014, CLEF.

[12]  David Bamman,et al.  Gender identity and lexical variation in social media , 2012, 1210.4567.

[13]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[14]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[15]  Benno Stein,et al.  Overview of the 3rd Author Profiling Task at PAN 2015 , 2015, CLEF.

[16]  Wessel Kraaij,et al.  Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15-18, 2014 , 2014, CLEF.

[17]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[18]  Eric Gilbert,et al.  VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text , 2014, ICWSM.

[19]  Christian Wolff,et al.  TWORPUS - An Easy-to-Use Tool for the Creation of Tailored Twitter Corpora , 2013, GSCL.