Cultural Differences in Bias? Origin and Gender Bias in Pre-Trained German and French Word Embeddings

Smart applications often rely on training data in form of text. If there is a bias in that training data, the decision of the applications might not be fair. Common training data has been shown to be biased towards different groups of minorities. However, there is no generic algorithm to determine the fairness of training data. One existing approach is to measure gender bias using word embeddings. Most research in this field has been dedicated to the English language. In this work, we identified that there is a bias towards gender and origin in both German and French word embeddings. In particular, we found that real-world bias and stereotypes from the 18th century are still included in today’s word embeddings. Furthermore, we show that the gender bias in German has a different form from English and there is indication that bias has cultural differences that need to be considered when analyzing texts and word embeddings in different languages.

[1]  Sonja Schmer-Galunder,et al.  Relating Word Embedding Gender Biases to Gender Gaps: A Cross-Cultural Analysis , 2019, Proceedings of the First Workshop on Gender Bias in Natural Language Processing.

[2]  Anupam Datta,et al.  Gender Bias in Neural Natural Language Processing , 2018, Logic, Language, and Security.

[3]  Daniel Jurafsky,et al.  Word embeddings quantify 100 years of gender and ethnic stereotypes , 2017, Proceedings of the National Academy of Sciences.

[4]  Bhaskar Mitra,et al.  A Dual Embedding Space Model for Document Ranking , 2016, ArXiv.

[5]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[6]  Chandler May,et al.  On Measuring Social Biases in Sentence Encoders , 2019, NAACL.

[7]  Eduardo Graells-Garrido,et al.  Women through the glass ceiling: gender asymmetries in Wikipedia , 2016, EPJ Data Science.

[8]  Arvind Narayanan,et al.  Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[9]  Ryan Cotterell,et al.  Examining Gender Bias in Languages with Grammatical Gender , 2019, EMNLP.

[10]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[11]  Chris Clifton,et al.  Combating discrimination using Bayesian networks , 2014, Artificial Intelligence and Law.

[12]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[13]  Andrew D. Selbst,et al.  Big Data's Disparate Impact , 2016 .

[14]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[15]  R. Böhm,et al.  Ein Vorname sagt mehr als 1000 Worte , 2007 .

[16]  Alfredo Maldonado,et al.  Measuring Gender Bias in Word Embeddings across Domains and Discovering New Gender Bias Word Categories , 2019, Proceedings of the First Workshop on Gender Bias in Natural Language Processing.

[17]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[18]  Brian A. Nosek,et al.  Harvesting implicit group attitudes and beliefs from a demonstration web site , 2002 .

[19]  Brian A. Nosek,et al.  Math Male , Me Female , Therefore Math Me , 2002 .

[20]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[21]  Yejin Choi,et al.  The Risk of Racial Bias in Hate Speech Detection , 2019, ACL.

[22]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[23]  Mai ElSherief,et al.  Mitigating Gender Bias in Natural Language Processing: Literature Review , 2019, ACL.

[24]  Katherine McCurdy,et al.  Grammatical gender associations outweigh topical gender bias in crosslinguistic word embeddings , 2020, ArXiv.

[25]  Joao Sedoc,et al.  Conceptor Debiasing of Word Representations Evaluated on WEAT , 2019, Proceedings of the First Workshop on Gender Bias in Natural Language Processing.

[26]  Bhaskar Mitra,et al.  Improving Document Ranking with Dual Word Embeddings , 2016, WWW.

[27]  David García,et al.  It's a Man's Wikipedia? Assessing Gender Inequality in an Online Encyclopedia , 2015, ICWSM.

[28]  A. Greenwald,et al.  Measuring individual differences in implicit cognition: the implicit association test. , 1998, Journal of personality and social psychology.

[29]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[30]  Murhaf Fares,et al.  Word vectors, reuse, and replicability: Towards a community repository of large-text resources , 2017, NODALIDA.