Embedding and Clustering for Cross-Genre Gender Prediction

For CLIN 2019 a shared task on binary gender prediction within and across different genres in Dutch was issued. This paper reports on the findings of team ‘Rob’s Angels‘ done in light of this shared task. A multitude of linear SVM models were created to predict gender in different genres (Twitter, YouTube and news), and cross-genre. Our best models used Twitter word-embeddings, in combination with removal of stopwords and tokenization of the text. We also introduced a novelty in classifying the news corpus. The large instances of news data are split into smaller parts, individually classified, and then the text as a whole is assigned a label based on majority voting. We eventually finished eighth on the in-genre category with an average accuracy of 0.617 and fourth on the cross-genre category with an average accuracy of 0.547.