What’s in a name? – gender classification of names with character based machine learning models

Gender information is no longer a mandatory input when registering for an account at many leading Internet companies. However, prediction of demographic information such as gender and age remains an important task, especially in intervention of unintentional gender/age bias in recommender systems. Therefore it is necessary to infer the gender of those users who did not to provide this information during registration. We consider the problem of predicting the gender of registered users based on their declared name. By analyzing the first names of 100M+ users, we found that genders can be very effectively classified using the composition of the name strings. We propose a number of character based machine learning models, and demonstrate that our models are able to infer the gender of users with much higher accuracy than baseline models. Moreover, we show that using the last names in addition to the first names improves classification performance further.

[1]  Wendy Liu,et al.  Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors , 2012, ICWSM.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Derek Ruths,et al.  Gender Inference of Twitter Users in Non-English Contexts , 2013, EMNLP.

[4]  Timothy Cribbin,et al.  An Interactive Method for Inferring Demographic Attributes in Twitter , 2015, HT.

[5]  Xiaojun Wan,et al.  Attention-based LSTM Network for Cross-Lingual Sentiment Classification , 2016, EMNLP.

[6]  Bert Huang,et al.  Beyond Parity: Fairness Objectives for Collaborative Filtering , 2017, NIPS.

[7]  Gerd Stumme,et al.  Gender Inference using Statistical Name Characteristics in Twitter , 2016, MISNC.

[8]  Steven Skiena,et al.  Generating Look-alike Names For Security Challenges , 2017, AISec@CCS.

[9]  Natalie Dixon,et al.  You are what you tweet , 2013, INTR.

[10]  Jahna Otterbacher,et al.  Inferring gender of movie reviewers: exploiting writing style, content and metadata , 2010, CIKM.

[11]  Theodoros Tzouramanis,et al.  A robust gender inference model for online social networks and its application to LinkedIn and Twitter , 2014, First Monday.

[12]  D. Ruths,et al.  What's in a Name? Using First Names as Features for Gender Inference in Twitter , 2013, AAAI Spring Symposium: Analyzing Microtext.

[13]  Xiaojun Ma,et al.  Twitter User Gender Inference Using Combined Analysis of Text and Image Processing , 2014, VL@COLING.

[14]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[15]  Lidong Bing,et al.  Recurrent Attention Network on Memory for Aspect Sentiment Analysis , 2017, EMNLP.

[16]  Steven Skiena,et al.  Nationality Classification Using Name Embeddings , 2017, CIKM.

[17]  Mark Dredze,et al.  Demographer: Extremely Simple Name Demographics , 2016, NLP+CSS@EMNLP.

[18]  Ana-Maria Popescu,et al.  A Machine Learning Approach to Twitter User Classification , 2011, ICWSM.

[19]  Aron Culotta,et al.  Predicting Twitter User Demographics using Distant Supervision from Website Traffic Data , 2016, J. Artif. Intell. Res..

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[22]  Putra Manggala,et al.  Using Image Fairness Representations in Diversity-Based Re-ranking for Recommendations , 2018, UMAP.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  John R. Smith,et al.  You are what you tweet…pic! gender prediction based on semantic analysis of social media images , 2015, 2015 IEEE International Conference on Multimedia and Expo (ICME).

[25]  Steven Skiena,et al.  Name-ethnicity classification from open sources , 2009, KDD.

[26]  Faiyaz Al Zamal,et al.  Using Social Media to Infer Gender Composition of Commuter Populations , 2012, Proceedings of the International AAAI Conference on Web and Social Media.

[27]  Christopher D. Manning,et al.  Baselines and Bigrams: Simple, Good Sentiment and Topic Classification , 2012, ACL.

[28]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[29]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[30]  D. Rao Detecting Latent User Properties in Social Media , 2010 .

[31]  Puneet Singh Ludu Inferring gender of a Twitter user using celebrities it follows , 2014, ArXiv.

[32]  Aron Culotta,et al.  Predicting the Demographics of Twitter Users from Website Traffic Data , 2015, AAAI.

[33]  Li Zhao,et al.  Attention-based LSTM for Aspect-level Sentiment Classification , 2016, EMNLP.