Debiasing Gender biased Hindi Words with Word-embedding

Word-embedding is a major machine learning technique for computational applications of languages. For a given corpus, the process of word-embedding is to embed each word onto multi-dimensional space such that semantic similarities between similar words are retained. While learning the similarity as encapsulated in the training corpus, the embedding process inadvertently captures many other inherent features present in the corpus. One such thing is the bias arising out of stereotyping present in almost all the corpus no matter how extensively used and trusted they are. We study this aspect of word-embedding in the context of Hindi language. We show that many gender-neutral words in Hindi are mapped to vectors which are inclined towards one gender or the other in multi-dimensional space. We propose a new algorithm of debiasing and demonstrate its efficacy in the context of Hindi language. Further, we build a SVM-based classifier that determines whether a gender-neutral word is classified as neutral or otherwise. We corroborate our claim with experimental results on large number of individual words. This work is first ever result on debiasing in Hindi Language and our new debiasing algorithm can be applicable in the context of any language.

[1]  Ido Dagan,et al.  context2vec: Learning Generic Context Embedding with Bidirectional LSTM , 2016, CoNLL.

[2]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[3]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[4]  Alan W Black,et al.  Black is to Criminal as Caucasian is to Police: Detecting and Removing Multiclass Bias in Word Embeddings , 2019, NAACL.

[5]  Yoav Goldberg,et al.  Adversarial Removal of Demographic Attributes from Text Data , 2018, EMNLP.

[6]  Christopher D. Manning,et al.  Bilingual Word Embeddings for Phrase-Based Machine Translation , 2013, EMNLP.

[7]  Ivan Titov,et al.  Bilingual Learning of Multi-sense Embeddings with Discrete Autoencoders , 2016, HLT-NAACL.

[8]  Jason Weston,et al.  Personalizing Dialogue Agents: I have a dog, do you have pets too? , 2018, ACL.

[9]  Adam Tauman Kalai,et al.  What are the Biases in My Word Embedding? , 2018, AIES.

[10]  Adam Tauman Kalai,et al.  Beyond Bilingual: Multi-sense Word Embeddings using Multilingual Context , 2017, Rep4NLP@ACL.

[11]  Ming Zhou,et al.  Sentiment Embeddings with Applications to Sentiment Analysis , 2016, IEEE Transactions on Knowledge and Data Engineering.

[12]  Shikha Bordia,et al.  Identifying and Reducing Gender Bias in Word-Level Language Models , 2019, NAACL.

[13]  Zeyu Li,et al.  Learning Gender-Neutral Word Embeddings , 2018, EMNLP.

[14]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[15]  Danushka Bollegala,et al.  Gender-preserving Debiasing for Pre-trained Word Embeddings , 2019, ACL.

[16]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.