A search engine approach to estimating temporal changes in gender orientation of first names

This paper presents an approach for predicting the gender orientation of any given first name over time based on a set of search engine queries with the name prefixed by masculine and feminine markers (e.g., "Uncle Taylor"). We hypothesize that these markers can capture the great majority of variability in gender orientation, including temporal changes. To test this hypothesis, we train a logistic regression model, with time-varying marker weights, using marker counts from Bing.com to predict male/female counts for 85,406 names in US Social Security Administration (SSA) data during 1880-2008. The model misclassifies 2.25% of the people in the SSA dataset (slightly worse than the 1.74% pure error rate) and provides accurate predictions for names beyond the SSA. The misclassification rate is higher in recent years (due to increasing name diversity), for general English words (e.g., Will), for names from certain countries (e.g., China), and for rare names. However, the model tends to err on the side of caution by predicting neutral/unknown rather than false positive. As hypothesized, the markers also capture temporal patterns of androgyny. For example, Daughter is a stronger female predictor for recent years while Grandfather is a stronger male predictor around the turn of the 20th century. The model has been implemented as a web-tool called Genni (available via http://abel.lis.illinois.edu/) that displays the predicted proportion of females vs. males over time for any given name. This should be a valuable resource for those who utilize names in order to discern gender on a large scale, e.g., to study the roles of gender and diversity in scholarly work based on digital libraries and bibliographic databases where the authors? names are listed.

[1]  Neil R. Smalheiser,et al.  Author name disambiguation in MEDLINE , 2009, TKDD.

[2]  Jahna Otterbacher,et al.  Inferring gender of movie reviewers: exploiting writing style, content and metadata , 2010, CIKM.

[3]  Susan T. Dumais,et al.  The Instability of Androgynous Names: The Symbolic Maintenance of Gender Boundaries , 2000, American Journal of Sociology.

[4]  Linsay Reece-Evans Gender and Citation in Two LIS E-Journals: A Bibliometric Analysis of LIBRES and Information Research , 2010 .

[5]  Daniel Jurafsky,et al.  He Said, She Said: Gender in the ACL Anthology , 2012, Discoveries@ACL.

[6]  Randy Goebel,et al.  Glen, Glenda or Glendale: Unsupervised and Semi-supervised Learning of English Noun Gender , 2009, CoNLL.

[7]  Michael L. Littman,et al.  Measuring praise and criticism: Inference of semantic orientation from association , 2003, TOIS.

[8]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[9]  P. Allison,et al.  Why Are Some Academic Fields Tipping Toward Female? The Sex Composition of U.S. Fields of Doctoral Degree Receipt, 1971–2002 , 2007 .

[10]  Miles Efron The liberal media and right-wing conspiracies: using cocitation information to estimate political orientation in web documents , 2004, CIKM.

[11]  Keith W. Ross,et al.  What's in a Name: A Study of Names, Gender Inference, and Gender Behavior in Facebook , 2011, DASFAA Workshops.

[12]  David Yarowsky,et al.  Minimally Supervised Induction of Grammatical Gender , 2003, HLT-NAACL.

[13]  Lise Getoor,et al.  To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles , 2009, WWW '09.

[14]  Marijke Breuning,et al.  Gender and Journal Authorship in Eight Prestigious Political Science Journals , 2007, PS: Political Science & Politics.

[15]  Martin Wattenberg Baby names, visualization, and social data analysis , 2005, IEEE Symposium on Information Visualization, 2005. INFOVIS 2005..

[16]  C. Lee Giles,et al.  Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching , 2012, AAAI.

[17]  R. Karniol The Color of Children’s Gender Stereotypes , 2011 .

[18]  Neil R. Smalheiser,et al.  A probabilistic similarity metric for Medline records: A model for author name disambiguation: Research Articles , 2005 .

[19]  P. Frassanito,et al.  Pink and blue: the color of gender , 2008, Child's Nervous System.

[20]  Catherine Sassen Gender and authorship in The Indexer, 1958-2007 , 2009 .