Simply the Best: Minimalist System Trumps Complex Models in Author Profiling

A simple linear SVM with word and character n-gram features and minimal parameter tuning can identify the gender and the language variety (for English, Spanish, Arabic and Portuguese) of Twitter users with very high accuracy. All our attempts at improving performance by including more data, smarter features, and employing more complex architectures plainly fail. In addition, we experiment with joint and multitask modelling, but find that they are clearly outperformed by single task models. Eventually, our simplest model was submitted to the PAN 2017 shared task on author profiling, obtaining an average accuracy of 0.86 on the test set, with performance on sub-tasks ranging from 0.68 to 0.98. These were the best results achieved at the competition overall. To allow lay people to easily use and see the value of machine learning for author profiling, we also built a web application on top our models.

[1]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[2]  Benno Stein,et al.  Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter , 2017, CLEF.

[3]  Tomoki Taniguchi,et al.  Author Profiling with Word+Character Neural Attention Network , 2017, CLEF.

[4]  Senja Pollak,et al.  PAN 2017: Author Profiling - Gender and Language Variety Prediction , 2017, CLEF.

[5]  Benno Stein,et al.  Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations , 2016, CLEF.

[6]  Helena Gómez-Adorno,et al.  Language- and Subtask-Dependent Feature Selection and Classifier Parameter Tuning for Author Profiling , 2017, CLEF.

[7]  Xiaodong Liu,et al.  Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval , 2015, NAACL.

[8]  Benno Stein,et al.  Overview of PAN'17 - Author Identification, Author Profiling, and Author Obfuscation , 2017, CLEF.

[9]  David Bamman,et al.  Gender identity and lexical variation in social media , 2012, 1210.4567.

[10]  Jason S. Kessler,et al.  Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ , 2017, ACL.

[11]  Malvina Nissim,et al.  An Analysis of Cross-Genre and In-Genre Performance for Author Profiling in Social Media , 2017, CLEF.

[12]  Mark Stevenson,et al.  Using TF-IDF n-gram and Word Embedding Cluster Ensembles for Author Profiling , 2017, CLEF.

[13]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[14]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[15]  Yves Bestgen,et al.  Improving the Character Ngram Model for the DSL Task with BM25 Weighting and Less Frequently Used Feature Sets , 2017, VarDial.

[16]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[17]  Dirk Hovy,et al.  Multitask Learning for Mental Health Conditions with Limited Social Media Data , 2017, EACL.

[18]  Daniela Moctezuma,et al.  Gender and language-variety Identification with MicroTC , 2017, CLEF.

[19]  Barbara Plank,et al.  When is multitask learning effective? Semantic sequence prediction under varying data conditions , 2016, EACL.

[20]  Paolo Rosso,et al.  A Low Dimensionality Representation for Language Variety Identification , 2016, CICLing.

[21]  Barbara Plank,et al.  When Sparse Traditional Models Outperform Dense Neural Networks: the Curious Case of Discriminating between Similar Languages , 2017, VarDial.

[22]  Malvina Nissim,et al.  GronUP: Groningen User Profiling , 2016, CLEF.

[23]  Jason Weston,et al.  Joint Learning of Words and Meaning Representations for Open-Text Semantic Parsing , 2012, AISTATS.

[24]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[25]  Kevin Duh,et al.  DyNet: The Dynamic Neural Network Toolkit , 2017, ArXiv.

[26]  Preslav Nakov,et al.  Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[27]  Hugo Jair Escalante,et al.  Social-Media Users can be profiled by their Similarity with other Users , 2017, CLEF.