Early author profiling on Twitter using profile features with multi-resolution

Abstract The Author Profiling (AP) task aims to predict demographic characteristics about the authors from documents (e.g., age, gender, native language). The research so far has focused only on forensic scenarios by performing post-analysis using all the available text evidence. This paper introduces the task of Early Author Profiling (EAP) in Twitter. The goal is to effectively recognize profiles using as few tweets as possible from the user history. The task is highly relevant to support social media analysis and different problems related to security and marketing, where prevention and anticipation is crucial. This work proposes a novel strategy that combines a state of the art representation for early text classification and specialized word-vectors for author profiling tasks. In this strategy we build prototypical features called Profile based Meta-Words, which allow us to model AP information at different levels of granularity. Our evaluation shows that the proposed methodology is well suited for profiling little text evidence (e.g., a handful of tweets) in early stages, but as more tweets become available other granularities better encode larger amounts of text in late stages. We evaluated the proposed ideas on gender and language variety identification for English and Spanish, and showed that the proposal outperforms state of the art methodologies.

[1]  Hugo Jair Escalante,et al.  Early text classification: a Naïve solution , 2016, WASSA@NAACL-HLT.

[2]  Luis Villaseñor Pineda,et al.  Evaluating Topic-Based Representations for Author Profiling in Social Media , 2016, IBERAMIA.

[3]  Zhongyang Xiong,et al.  Fast text categorization using concise semantic analysis , 2011, Pattern Recognit. Lett..

[4]  Andrew Zisserman,et al.  Video data mining using configurations of viewpoint invariant regions , 2004, CVPR 2004.

[5]  Hugo Jair Escalante,et al.  INAOE's Participation at PAN'15: Author Profiling task , 2015, CLEF.

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Fabrizio Sebastiani,et al.  Distributional term representations: an experimental comparison , 2004, CIKM '04.

[8]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[9]  Hugo Jair Escalante,et al.  Early Text Classification Using Multi-Resolution Concept Representations , 2018, NAACL-HLT.

[10]  Marcelo Luis Errecalde,et al.  Temporal Variation of Terms as Concept Space for Early Risk Prediction , 2017, CLEF.

[11]  Fabio Crestani,et al.  eRISK 2017: CLEF Lab on Early Risk Prediction on the Internet: Experimental Foundations , 2017, CLEF.

[12]  Manuel Montes-y-Gómez,et al.  Emphasizing personal information for Author Profiling: New approaches for term selection and weighting , 2018, Knowl. Based Syst..

[13]  Benno Stein,et al.  Overview of the Author Profiling Task at PAN 2013 , 2013, CLEF.

[14]  Carolyn Penstein Rosé,et al.  Author Age Prediction from Text using Linear Regression , 2011, LaTeCH@ACL.

[15]  Benno Stein,et al.  Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations , 2016, CLEF.

[16]  Wesley De Neve,et al.  Multimedia Lab @ ACL WNUT NER Shared Task: Named Entity Recognition for Twitter Microposts using Distributed Word Representations , 2015, NUT@IJCNLP.

[17]  Malvina Nissim,et al.  GronUP: Groningen User Profiling: Notebook for PAN at CLEF 2016 , 2016 .

[18]  Fabio Crestani,et al.  Overview of eRisk: Early Risk Prediction on the Internet (Extended Lab Overview) , 2018, CLEF.

[19]  Benno Stein,et al.  Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter , 2017, CLEF.

[20]  Patrick Gallinari,et al.  Text Classification: A Sequential Reading Approach , 2011, ECIR.

[21]  Hugo Jair Escalante,et al.  Term-weighting learning via genetic programming for text classification , 2014, Knowl. Based Syst..

[22]  Hugo Jair Escalante,et al.  Discriminative subprofile-specific representations for author profiling in social media , 2015, Knowl. Based Syst..

[23]  Hugo Jair Escalante,et al.  Early detection of deception and aggressiveness using profile-based representations , 2017, Expert Syst. Appl..

[24]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[25]  Derek Greene,et al.  Practical solutions to the problem of diagonal dominance in kernel document clustering , 2006, ICML.