论文信息 - Embedding and Clustering for Cross-Genre Gender Prediction

Embedding and Clustering for Cross-Genre Gender Prediction

For CLIN 2019 a shared task on binary gender prediction within and across different genres in Dutch was issued. This paper reports on the findings of team ‘Rob’s Angels‘ done in light of this shared task. A multitude of linear SVM models were created to predict gender in different genres (Twitter, YouTube and news), and cross-genre. Our best models used Twitter word-embeddings, in combination with removal of stopwords and tokenization of the text. We also introduced a novelty in classifying the news corpus. The large instances of news data are split into smaller parts, individually classified, and then the text as a whole is assigned a label based on majority voting. We eventually finished eighth on the in-genre category with an average accuracy of 0.617 and fourth on the cross-genre category with an average accuracy of 0.547.

Kelly Dekker | Rianne Bos | Harm-Jan Setz

[1] Paolo Rosso,et al. Overview of the RUSProfiling PAN at FIRE Track on Cross-genre Gender Identification in Russian , 2017, FIRE.

[2] Gertjan van Noord,et al. MoNoise: Modeling Noise Using a Modular Normalization System , 2017, ArXiv.

[3] K. Loewenthal,et al. Inferring gender from handwriting in Urdu and English. , 1996, The Journal of social psychology.

[4] Malvina Nissim,et al. Overview of the EVALITA 2018 Cross-Genre Gender Prediction (GxG) Task , 2018, EVALITA@CLiC-it.

[5] Gosse Bouma,et al. N-gram Frequencies for Dutch Twitter Data , 2015 .

[6] A.P.J. van den Bosch,et al. Dealing with big data: The case of Twitter , 2013, CLIN 2013.