Gender Representation in Open Source Speech Resources

With the rise of artificial intelligence (AI) and the growing use of deep-learning architectures, the question of ethics, transparency and fairness of AI systems has become a central concern within the research community. We address transparency and fairness in spoken language systems by proposing a study about gender representation in speech resources available through the Open Speech and Language Resource platform. We show that finding gender information in open source corpora is not straightforward and that gender balance depends on other corpus characteristics (elicited/non elicited speech, low/high resource language, speech task targeted). The paper ends with recommendations about metadata and gender information for researchers in order to assure better transparency of the speech systems built using such corpora.

[1]  Dirk Hovy,et al.  The Social Impact of Natural Language Processing , 2016, ACL.

[2]  Benjamin Lecouteux,et al.  Using resources from a closely-related language to develop ASR for a very under-resourced language: a case study for iban , 2015, INTERSPEECH.

[3]  Arvind Narayanan,et al.  Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[4]  Mark West,et al.  I'd blush if I could: closing gender divides in digital skills through education , 2019 .

[5]  Emily M. Bender,et al.  Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science , 2018, TACL.

[6]  Khalil Sima'an,et al.  Wired for Speech: How Voice Activates and Advances the Human-Computer Relationship , 2006, Computational Linguistics.

[7]  Isabelle Hare,et al.  What makes the news? , 2010, Nature Structural Biology.

[8]  Andy Way,et al.  Getting Gender Right in Neural Machine Translation , 2019, EMNLP.

[9]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[10]  Yannick Estève,et al.  TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation , 2018, SPECOM.

[11]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[12]  Sylvain Meignier,et al.  An Open-Source Speaker Gender Detection Framework for Monitoring Gender Equality , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Gilles Adda,et al.  Evaluating corpora documentation with regards to the Ethics and Big Data Charter , 2014, LREC.

[14]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[16]  Solange Rossato,et al.  Gender Representation in French Broadcast Corpora and Its Impact on ASR Performance , 2019, AI4TV@MM.

[17]  Ondrej Dusek,et al.  Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license , 2014, LREC.

[18]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[19]  Heiga Zen,et al.  LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech , 2019, INTERSPEECH.