Gender Prediction for Authors of Russian Texts Using Regression And Classification Techniques

Automatic extraction of information about authors of texts (gender, age, psychological type, etc.) based on the analysis of linguistic parameters has gained a particular significance as there are more online texts whose authors either avoid providing any personal data or make it intentionally deceptive despite of it being of practical importance in marketing, forensics, sociology. These studies have been performed over the last 10 years and mainly for English. The paper presents the results of the study of a corpus of Russian-language texts RusPersonality that addressed automatic identification of the gender of the author of a Russian text using mostly topic-independent text parameters. The identification of the gender of authors of texts was addressed as a classification as well as regression task. For the first time for Russian texts we have obtained the models classifying authors of texts according to their gender with the accuracy identical to the state-of-the-art one.

[1]  George M. Mohay,et al.  Gender-preferential text mining of e-mail discourse , 2002, 18th Annual Computer Security Applications Conference, 2002. Proceedings..

[2]  Yejin Choi,et al.  Gender Attribution: Tracing Stylometric Evidence Beyond Topic and Genre , 2011, CoNLL.

[3]  David N. Chin,et al.  Personality Profiling from Text: Introducing Part-of-Speech N-Grams , 2014, UMAP.

[4]  Rajarathnam Chandramouli,et al.  Author gender identification from text , 2011, Digit. Investig..

[5]  Roman Rybka,et al.  Morpho-syntactic parsing based on neural networks and corpus data , 2015, 2015 Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT).

[6]  Benno Stein,et al.  Overview of the 3rd Author Profiling Task at PAN 2015 , 2015, CLEF.

[7]  Zachary Miller,et al.  Author Gender Prediction in an Email Stream Using Neural Networks , 2012 .

[8]  Tatiana Litvinova,et al.  Using Part-of-Speech Sequences Frequencies in a Text to Predict Author Personality: a Corpus Study , 2015 .

[9]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[10]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[11]  Derek Ruths,et al.  Gender Inference of Twitter Users in Non-English Contexts , 2013, EMNLP.

[12]  Alexander Sboev,et al.  A Quantitative Method of Text Emotiveness Evaluation on Base of the Psycholinguistic Markers Founded on Morphological Features , 2015 .