XRCE Personal Language Analytics Engine for Multilingual Author Profiling: Notebook for PAN at CLEF 2015

This technical notebook describes the methodology used - and results achieved - for the PAN 2015 Author Profiling Challenge by the team from Xe- rox Research Centre Europe (XRCE). This year, personality traits are introduced alongside age and gender in a corpus of tweets in four languages - English, Span- ish, Italian and Dutch. We describe a largely language agnostic methodology for classification which uses language specific linguistic processing to generate fea- tures. We also report on experiments in which we use machine translation to accommodate for languages in which there is less training data. Native language results are successful, but socio-demographic signals in language seem to be lost under MT conditions.

[1]  Benno Stein,et al.  Overview of the 2 nd Author Profiling Task at PAN 2014 , 2014 .

[2]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[3]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[4]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[5]  O. John,et al.  Measuring personality in one minute or less: A 10-item short version of the Big Five Inventory in English and German , 2007 .

[6]  Jean-Pierre Chanod,et al.  A Multi-Input Dependency Parser , 2001, IWPT.

[7]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[8]  Timothy Baldwin,et al.  Automatically Constructing a Normalisation Dictionary for Microblogs , 2012, EMNLP.

[9]  Mauro Cettolo,et al.  WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.

[10]  HansenPer Christian The truncated SVD as a method for regularization , 1987 .

[11]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[12]  Scott Nowson,et al.  Look! Who's Talking?: Projection of Extraversion Across Different Social Contexts , 2014, WCPR '14.

[13]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[14]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[15]  Marko Tkal,et al.  Preface: EMPIRE 2014 - 2nd Workshop on Emotions and Personality in Personalized Services , 2014 .

[16]  Benno Stein,et al.  Overview of the 3rd Author Profiling Task at PAN 2015 , 2015, CLEF.

[17]  Alexandra Balahur,et al.  Multilingual Sentiment Analysis using Machine Translation? , 2012, WASSA@ACL.

[18]  Alastair J. Gill,et al.  Taking Care of the Linguistic Features of Extraversion , 2019, Proceedings of the Twenty-Fourth Annual Conference of the Cognitive Science Society.

[19]  Fabio Pianesi,et al.  The Workshop on Computational Personality Recognition 2014 , 2014, ACM Multimedia.

[20]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[21]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[22]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.