Gender and Dialect Bias in YouTube’s Automatic Captions

This project evaluates the accuracy of YouTube’s automatically-generated captions across two genders and five dialect groups. Speakers’ dialect and gender was controlled for by using videos uploaded as part of the “accent tag challenge”, where speakers explicitly identify their language background. The results show robust differences in accuracy across both gender and dialect, with lower accuracy for 1) women and 2) speakers from Scotland. This finding builds on earlier research finding that speaker’s sociolinguistic identity may negatively impact their ability to use automatic speech recognition, and demonstrates the need for sociolinguistically-stratified validation of systems.

[1]  W. B. Watson,et al.  THE POPULATION OF THE UNITED STATES. , 1959, Science.

[2]  P. Trudgill Sex, covert prestige and linguistic change in the urban British English of Norwich , 1972, Language in Society.

[3]  W. Viereck Dictionary of American Regional English , 1986 .

[4]  W. Abdulla,et al.  Improving speech recognition performance through gender separation , 1988 .

[5]  P. Eckert The whole woman: Sex and gender differences in variation , 1989, Language Variation and Change.

[6]  Joseph Picone,et al.  Voice across America: Toward robust speaker-independent speech recognition for telecommunications applications , 1991, Digit. Signal Process..

[7]  D. Childers,et al.  Gender recognition from speech. Part I: Coarse analysis. , 1991, The Journal of the Acoustical Society of America.

[8]  Yochai Konig,et al.  GDNN: a gender-dependent neural network for continuous speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[9]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  J. Milroy,et al.  Real English: The Grammar of English Dialects in the British Isles , 1993 .

[11]  Ronald A. Cole,et al.  New telephone speech corpora at CSLU , 1995, EUROSPEECH.

[12]  G. Mcnicoll,et al.  The Population of the United States. , 1997 .

[13]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[14]  Julia Hirschberg,et al.  Prosodic and other cues to speech recognition failures , 2004, Speech Commun..

[15]  Constance M. Clarke,et al.  Rapid adaptation to foreign-accented English. , 2004, The Journal of the Acoustical Society of America.

[16]  Jean Carletta,et al.  The AMI meeting corpus , 2005 .

[17]  M. P. Gelfer,et al.  The relative contributions of speaking fundamental frequency and formant frequencies to gender identification based on isolated vowels. , 2005, Journal of voice : official journal of the Voice Foundation.

[18]  Lawrence K. Saul,et al.  Large Margin Hidden Markov Models for Automatic Speech Recognition , 2006, NIPS.

[19]  Elizabeth Gordon,et al.  New Zealand English , 2008 .

[20]  A. Samuel,et al.  The effect of experience on the perception and representation of dialect variants , 2009 .

[21]  Daniel Jurafsky,et al.  Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates , 2010, Speech Commun..

[22]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[23]  Renata Štajner New Zealand English , 2011 .

[24]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[25]  Sid-Ahmed Selouani,et al.  Speaker-independent ASR for Modern Standard Arabic: effect of regional accents , 2012, International Journal of Speech Technology.

[26]  M Sawalha,et al.  The effects of speakers' gender, age, and region on overall performance of Arabic automatic speech recognition systems using the phonetically rich and balanced Modern Standard Arabic speech corpus , 2013 .

[27]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[28]  J. H. Hall Dictionary of American Regional English , 2013 .

[29]  Erwan Pépiot Male and female speech: a study of mean f0, f0 range, phonation type and speech rate in Parisian French and American English speakers , 2014 .

[30]  Mari Ostendorf,et al.  ATAROS Technical Report 1: Corpus collection and initial task validation , 2014 .

[31]  Lior Shamir,et al.  Assessing the efficacy of benchmarks for automatic speech accent recognition , 2015, EAI Endorsed Trans. Creative Technol..

[32]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Yifan Gong,et al.  Geo-location dependent deep neural network acoustic model for speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Dirk Hovy,et al.  The Social Impact of Natural Language Processing , 2016, ACL.

[35]  Brendan T. O'Connor,et al.  Demographic Dialectal Variation in Social Media: A Case Study of African-American English , 2016, EMNLP.

[36]  Sorelle A. Friedler,et al.  Hiring by Algorithm: Predicting and Preventing Disparate Impact , 2016 .

[37]  Geoffrey Zweig,et al.  The microsoft 2016 conversational speech recognition system , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Daniel Bell American Journal of Roentgenology , 2017, Radiopaedia.org.