Detecting Gender by Full Name: Experiments with the Russian Language

This paper describes a method that detects gender of a person by his/her full name. While some approaches were proposed for English language, little has been done so far for Russian. We fill this gap and present a large-scale experiment on a dataset of 100,000 Russian full names from Facebook. Our method is based on three types of features (word endings, character \(n\)-grams and dictionary of names) combined within a linear supervised model. Experiments show that the proposed simple and computationally efficient approach yields excellent results achieving accuracy up to 96 %.

[1]  Wendy Liu,et al.  Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors , 2012, ICWSM.

[2]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[3]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[4]  Sudeshna Sarkar,et al.  Stylometric Analysis of Bloggers' Age and Gender , 2009, ICWSM.

[5]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[6]  Alan Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[7]  Chih-Jen Lin,et al.  Dual coordinate descent methods for logistic regression and maximum entropy models , 2011, Machine Learning.

[8]  Dong Nguyen,et al.  "How Old Do You Think I Am?" A Study of Language and Age in Twitter , 2013, ICWSM.

[9]  Eugene Kharitonov,et al.  Gender-aware re-ranking , 2012, SIGIR '12.

[10]  Derek Ruths,et al.  Gender Inference of Twitter Users in Non-English Contexts , 2013, EMNLP.

[11]  Faiyaz Al Zamal,et al.  Using Social Media to Infer Gender Composition of Commuter Populations , 2012, Proceedings of the International AAAI Conference on Web and Social Media.

[12]  Karen L. Bloomquist Ekklesia in the Midst of Public Outrage Today , 2012 .

[13]  A. Agresti Categorical data analysis , 1993 .

[14]  Arjun Mukherjee,et al.  Improving Gender Classification of Blog Authors , 2010, EMNLP.

[15]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[16]  Alexander Panchenko,et al.  Towards Detection of Child Sexual Abuse Media: Categorization of the Associated Filenames , 2013, ECIR.

[17]  Paolo Rosso,et al.  Use of Language and Author Profiling : Identification of Gender and Age , 2013 .

[18]  Walter Daelemans,et al.  Predicting age and gender in online social networks , 2011, SMUC '11.

[19]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[20]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[21]  Milad Shokouhi,et al.  Inferring the demographics of search users: social data meets search queries , 2013, WWW.

[22]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[23]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..