Twitter Text and Image Gender Classification with a Logistic Regression N-Gram Model: Notebook for PAN at CLEF 2018

We present our participation in the PAN 2018 Author Profiling shared task, classifying authors on gender for English, Arabic and Spanish. We participated in all sub-tasks and propose a system for classification with text, images and the combination of those two. Our final submitted system is a Logistic Regression classifier that uses word and character n-grams as textual features and a set of automatically derived image-based features such as the presence, proportion and number of faces to detect selfies as well as the faces’ emotions and gender. We experimented with word embeddings, which negatively affected our system’s performance. Our cross-validated training results shows slight improvements in performance for Arabic and Spanish when image-based features are added to text-based features. Our highest scores on the PAN 2018 test dataset are accuracies of 81.2% for English using only text-based features, 78.7% for Arabic using both textand image-based features and 80.3% for Spanish using only text-based features. Overall, we finished 6 in the global ranking with an average accuracy for our text and image combination system of 79.6%.

[1]  Benno Stein,et al.  Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations , 2016, CLEF.

[2]  Liviu P. Dinu,et al.  Including Dialects and Language Varieties in Author Profiling , 2017, CLEF.

[3]  Benno Stein,et al.  Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter , 2018, CLEF.

[4]  S. T. Buckland,et al.  Computer-Intensive Methods for Testing Hypotheses. , 1990 .

[5]  Nils Schaetti UniNE at CLEF 2017: TF-IDF and Deep-Learning for Author Profiling , 2017, CLEF.

[6]  Amandeep Dhir,et al.  Do age and gender differences exist in selfie-related behaviours? , 2016, Comput. Hum. Behav..

[7]  Alexey Romanov,et al.  Language Variety and Gender Classification for Author Profiling in PAN 2017 , 2017, CLEF.

[8]  Michael Granitzer,et al.  INSA LYON and UNI PASSAU's Participation at PAN@CLEF'17: Author Profiling task , 2017, CLEF.

[9]  Malvina Nissim,et al.  N-GrAM: New Groningen Author-profiling Model , 2017, CLEF.

[10]  Malvina Nissim,et al.  GronUP: Groningen User Profiling , 2016, CLEF.

[11]  Khaled Alrifai,et al.  Arabic Tweeps Gender and Dialect Prediction , 2017, CLEF.

[12]  Helena Gómez-Adorno,et al.  Language- and Subtask-Dependent Feature Selection and Classifier Parameter Tuning for Author Profiling , 2017, CLEF.

[13]  Rik van Noord,et al.  UG18 at SemEval-2018 Task 1: Generating Additional Training Data for Predicting Emotion Intensity in Spanish , 2018, SemEval@NAACL-HLT.

[14]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[15]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[16]  K. Pisanski,et al.  Selfie posting behaviors are associated with narcissism among men , 2015 .

[17]  Mark Cieliebak,et al.  Author Profiling with Bidirectional RNNs using Attention with GRUs , 2017, CLEF.

[18]  Senja Pollak,et al.  PAN 2017: Author Profiling - Gender and Language Variety Prediction , 2017, CLEF.

[19]  Matias Valdenegro-Toro,et al.  Real-time Convolutional Neural Networks for emotion and gender classification , 2017, ESANN.

[20]  Benno Stein,et al.  Overview of PAN'17 - Author Identification, Author Profiling, and Author Obfuscation , 2017, CLEF.