Generating Gender Augmented Data for NLP

Gender bias is a frequent occurrence in NLP-based applications, especially pronounced in gender-inflected languages. Bias can appear through associations of certain adjectives and animate nouns with the natural gender of referents, but also due to unbalanced grammatical gender frequencies of inflected words. This type of bias becomes more evident in generating conversational utterances where gender is not specified within the sentence, because most current NLP applications still work on a sentence-level context. As a step towards more inclusive NLP, this paper proposes an automatic and generalisable re-writing approach for short conversational sentences. The rewriting method can be applied to sentences that, without extra-sentential context, have multiple equivalent alternatives in terms of gender. The method can be applied both for creating gender balanced outputs as well as for creating gender balanced training data. The proposed approach is based on a neural machine translation system trained to ‘translate’ from one gender alternative to another. Both the automatic and manual analysis of the approach show promising results with respect to the automatic generation of gender alternatives for conversational sentences in Spanish.

[1]  Matt Post,et al.  The Sockeye Neural Machine Translation Toolkit at AMTA 2018 , 2018, AMTA.

[2]  Solon Barocas,et al.  Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.

[3]  Matteo Negri,et al.  Gender in Danger? Evaluating Speech Translation Technology on the MuST-SHE Corpus , 2020, ACL.

[4]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[5]  Ryan Cotterell,et al.  Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich Morphology , 2019, ACL.

[6]  Lauren Ackerman,et al.  Syntactic and cognitive issues in investigating gendered coreference , 2019 .

[7]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[8]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[9]  Nizar Habash,et al.  Automatic Gender Identification and Reinflection in Arabic , 2019, Proceedings of the First Workshop on Gender Bias in Natural Language Processing.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Noe Casas,et al.  Extensive study on the underlying gender bias in contextualized word embeddings , 2020, Neural Computing and Applications.

[12]  Zeyu Li,et al.  Learning Gender-Neutral Word Embeddings , 2018, EMNLP.

[13]  Marta R. Costa-jussà,et al.  An analysis of gender bias studies in natural language processing , 2019, Nature Machine Intelligence.

[14]  James Zou,et al.  AI can be sexist and racist — it’s time to make it fair , 2018, Nature.

[15]  Andy Way,et al.  Getting Gender Right in Neural Machine Translation , 2019, EMNLP.

[16]  Friederike Braun,et al.  Representation of the sexes in language , 2007 .

[17]  Dimitar Shterionov,et al.  Machine Translationese: Effects of Algorithmic Bias on Linguistic Complexity in Machine Translation , 2021, EACL.

[18]  Kim Jung,et al.  Gender bias in natural language processing: BioCorpus-5, a preliminary multilingual Gender-Balanced Corpus of in-domain wikipedia biographies , 2019 .