Multilingual author profiling using word embedding averages and SVMs

This paper describes an experiment done to investigate author profiling of tweets in English and Spanish, particularly for cross genre evaluation. Profiling consists of age and gender classification. The training sets were taken from tweets while genres for evaluation come from blogs, hotel reviews, other tweets collected in a different time, as well as other social media. Comparisons were done between tfidf as a baseline and average of word vectors, using a Support Vector Machine algorithm. Results show that using average of word vectors outperforms tfidf in most cross genre problems for age and gender.

[1]  Thamar Solorio,et al.  A Simple Approach to Author Profiling in MapReduce , 2014, CLEF.

[2]  Michal Meina,et al.  Ensemble-based Classification for Author Profiling Using Various Features Notebook for PAN at CLEF 2013 , 2013, CLEF.

[3]  Vasudeva Varma,et al.  Author Profiling: Predicting Age and Gender from Blogs Notebook for PAN at CLEF 2013 , 2013, CLEF.

[4]  Shlomo Argamon,et al.  Automatically profiling the author of an anonymous text , 2009, CACM.

[5]  José Palazzo Moreira de Oliveira,et al.  Exploring Information Retrieval Features for Author Profiling , 2014, CLEF.

[6]  Michael Halliday,et al.  An Introduction to Functional Grammar , 1985 .

[7]  Marie-Francine Moens,et al.  Age and Gender Identification in Social Media , 2014, CLEF.

[8]  Azucena Montes Rendón,et al.  Tweets Classification using Corpus Dependent Tags, Character and POS N-grams , 2015, CLEF.

[9]  Benno Stein,et al.  Overview of the 3rd Author Profiling Task at PAN 2015 , 2015, CLEF.

[10]  Benno Stein,et al.  Overview of the Author Profiling Task at PAN 2013 , 2013, CLEF.

[11]  Benno Stein,et al.  Overview of the 2 nd Author Profiling Task at PAN 2014 , 2014 .

[12]  Hugo Jair Escalante,et al.  Using Intra-Profile Information for Author Profiling , 2014, CLEF.

[13]  Benno Stein,et al.  Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations , 2016, CLEF.

[14]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[15]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[16]  Teresa Gonçalves,et al.  Author Profiling using SVMs and Word Embedding Averages , 2016, CLEF.

[17]  José Carlos González,et al.  DAEDALUS at PAN 2014: Guessing Tweet Author's Gender and Age , 2014, CLEF.

[18]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[19]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[20]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[21]  Hugo Jair Escalante,et al.  INAOE's Participation at PAN'13: Author Profiling Task Notebook for PAN at CLEF 2013 , 2013, CLEF.