AraWEAT: Multidimensional Analysis of Biases in Arabic Word Embeddings

Recent work has shown that distributional word vector spaces often encode human biases like sexism or racism. In this work, we conduct an extensive analysis of biases in Arabic word embeddings by applying a range of recently introduced bias tests on a variety of embedding spaces induced from corpora in Arabic. We measure the presence of biases across several dimensions, namely: embedding models (Skip-Gram, CBOW, and FastText) and vector sizes, types of text (encyclopedic text, and news vs. user-generated content), dialects (Egyptian Arabic vs. Modern Standard Arabic), and time (diachronic analyses over corpora from different time periods). Our analysis yields several interesting findings, e.g., that implicit gender bias in embeddings trained on Arabic news corpora steadily increases over time (between 2007 and 2017). We make the Arabic bias specifications (AraWEAT) publicly available.

[1]  Graeme Hirst,et al.  Understanding Undesirable Word Embedding Associations , 2019, ACL.

[2]  Ryan Cotterell,et al.  It’s All in the Name: Mitigating Gender Bias with Name-Based Counterfactual Data Substitution , 2019, EMNLP.

[3]  Thomas Eckart,et al.  Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages , 2012, LREC.

[4]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[5]  Solon Barocas,et al.  Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.

[6]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[7]  Paolo Rosso,et al.  A resource-light method for cross-lingual semantic textual similarity , 2017, Knowl. Based Syst..

[8]  Ryan Cotterell,et al.  Examining Gender Bias in Languages with Grammatical Gender , 2019, EMNLP.

[9]  Arvind Narayanan,et al.  Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[10]  Samhaa R. El-Beltagy,et al.  AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP , 2017, ACLING.

[11]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[12]  Nizar Habash,et al.  Automatic Gender Identification and Reinflection in Arabic , 2019, Proceedings of the First Workshop on Gender Bias in Natural Language Processing.

[13]  Goran Glavas,et al.  A General Framework for Implicit and Explicit Debiasing of Distributional Word Vector Spaces , 2020, AAAI.

[14]  Katherine McCurdy,et al.  Grammatical gender associations outweigh topical gender bias in crosslinguistic word embeddings , 2020, ArXiv.

[15]  Shady Elbassuoni,et al.  Methodical Evaluation of Arabic Word Embeddings , 2017, ACL.

[16]  Yoav Goldberg,et al.  Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them , 2019, NAACL-HLT.

[17]  Brian A. Nosek,et al.  Harvesting implicit group attitudes and beliefs from a demonstration web site , 2002 .

[18]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[19]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[20]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[21]  Goran Glavas,et al.  Are We Consistently Biased? Multidimensional Analysis of Biases in Distributional Word Vectors , 2019, *SEMEVAL.

[22]  Jeff M. Phillips,et al.  Attenuating Bias in Word Vectors , 2019, AISTATS.