Author Gender Prediction in Russian Social Media Texts

Presently natural language processing for social media, in particular in the field of sentiment analysis and topic modeling, is gaining momentum for Russian texts. However, Slavic languages including Russian are still insufficiently explored in terms of computational sociolinguistics and authorship profiling (i.e. automatic identification of latent demographic features of online users such as gender, age, personality based on their texts). Being able to predict these features with a high degree of accuracy would certainly benefit marketing, psychological studies and security. In this paper we are attempting to build classifiers to predict gender of the author in Russian Twitter and Facebook texts and explore the effect of the cross-genre evaluation. We used the most common lemmas, a set of morphological and syntactic parameters as well as the part-ofspeech (POS) trigrams as features and multiple classifiers to train and test models. Twitter corpus was used for training, Facebook and test set of Twitter corpus were used for testing. The best models for Twitter were ExtraTreesClassifier and RandomForestClassifier with accuracy 0.72 and linearSVM for Facebook (0.71). The obtained results are comparable with stateof-the art results for Russian language for the texts of different genres.

[1]  Derek Ruths,et al.  Gender Inference of Twitter Users in Non-English Contexts , 2013, EMNLP.

[2]  Benno Stein,et al.  Overview of the 3rd Author Profiling Task at PAN 2015 , 2015, CLEF.

[3]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[4]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[5]  Elena Tutubalina,et al.  SentiRuEval: testing object-oriented sentiment analysis systems in Russian , 2015 .

[6]  Ben Verhoeven,et al.  Gender Profiling for Slovene Twitter communication: the Influence of Gender Marking, Content and Style , 2017, BSNLP@EACL.

[7]  P. Seredin,et al.  Gender identification in Russian written texts , 2017 .

[8]  Tatiana Litvinova,et al.  Gender Prediction for Authors of Russian Texts Using Regression And Classification Techniques , 2016, CDUD@CLA.

[9]  Ilya Segalovich,et al.  A Fast Morphological Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine , 2003, MLMTA.

[10]  Benno Stein,et al.  Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations , 2016, CLEF.

[11]  Jacques Savoy,et al.  Comparative evaluation of term selection functions for authorship attribution , 2015, Digit. Scholarsh. Humanit..

[12]  G VasilyevV.,et al.  ASPECT EXTRACTION AND TWITTER SENTIMENT CLASSIFICATION BY FRAGMENT RULES , 2015 .

[13]  Olessia Koltsova,et al.  Communities of co-commenting in the Russian LiveJournal and their topical coherence , 2016, Internet Res..

[14]  Rao Muhammad Adeel Nawab,et al.  Cross-Genre Author Profile Prediction Using Stylometry-Based Approach , 2016, CLEF.