Word Unigram Weighing for Author Profiling at PAN 2018: Notebook for PAN at CLEF 2018

We present our system for the author profiling task at PAN 2018 on gender identification on Twitter. The submitted system uses word unigrams, character 1to 5-grams and emoji unigrams as features to train a logistic regression classifier. We explore the impact of three different word unigram weighing schemes on our system’s performance. Our submission achieved accuracies of 77.42% for English, 74.64% for Spanish, and 73.20% for Arabic tweets. It ranked 15th out of 23 competitors.