AequeVox: Automated Fairness Testing of Speech Recognition Systems

Automatic Speech Recognition (ASR) systems have become ubiquitous. They can be found in a variety of form factors and are increasingly important in our daily lives. As such, ensuring that these systems are equitable to different subgroups of the population is crucial. In this paper, we introduce, AequeVox, an automated testing framework for evaluating the fairness of ASR systems. AequeVox simulates different environments to assess the effectiveness of ASR systems for different populations. In addition, we investigate whether the chosen simulations are comprehensible to humans. We further propose a fault localization technique capable of identifying words that are not robust to these varying environments. Both components of AequeVox are able to operate in the absence of ground truth data. We evaluated AequeVox on speech from four different datasets using three different commercial ASRs. Our experiments reveal that non-native English, female and Nigerian English speakers generate 109%, 528.5% and 156.9% more errors, on average than native English, male and UK Midlands speakers, respectively. Our user study also reveals that 82.9% of the simulations (employed through speech transformations) had a comprehensibility rating above seven (out of ten), with the lowest rating being 6.78. This further validates the fairness violations discovered by AequeVox. Finally, we show that the non-robust words, as predicted by the fault localization technique embodied in AequeVox, show 223.8% more errors than the predicted robust words across all ASRs.

[1]  J.J. Shea Communications Standard Dictionary , 1998, IEEE Electrical Insulation Magazine.

[2]  Sudipta Chattopadhyay,et al.  Grammar Based Directed Testing of Machine Learning Systems , 2019, ArXiv.

[3]  Alper Sen,et al.  DeepFault: Fault Localization for Deep Neural Networks , 2019, FASE.

[4]  Steven H. Weinberger,et al.  The Speech Accent Archive: towards a typology of English accents , 2011 .

[5]  Heike Wehrheim,et al.  Testing Machine Learning Algorithms for Balanced Data Usage , 2019, 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST).

[6]  Axel-Cyrille Ngonga Ngomo,et al.  MLCheck- Property-Driven Testing of Machine Learning Models , 2021, ArXiv.

[7]  Heike Wehrheim,et al.  Automatic Fairness Testing of Machine Learning Models , 2020, ICTSS.

[8]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[9]  Yi Li,et al.  DeepCruiser: Automated Guided Testing for Stateful Deep Learning Systems , 2018, ArXiv.

[10]  Lei Ma,et al.  DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[11]  Sudipta Chattopadhyay,et al.  Automated Directed Fairness Testing , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[12]  Li Zhou,et al.  Incidence of speech recognition errors in the emergency department , 2013, Int. J. Medical Informatics.

[13]  Mark Hawley,et al.  Speech Recognition as an Input to Electronic Assistive Technology , 2002 .

[14]  Diptikalyan Saha,et al.  Black box fairness testing of machine learning models , 2019, ESEC/SIGSOFT FSE.

[15]  Jin Liu,et al.  Metamorphic Testing and Certified Mitigation of Fairness Violations in NLP Models , 2020, IJCAI.

[16]  Yang Feng,et al.  DeepGini: prioritizing massive tests to enhance the robustness of deep neural networks , 2020, ISSTA.

[17]  Shuo Chen,et al.  Applying automatic speech recognition technology to Air Traffic Management , 2013, 2013 IEEE/AIAA 32nd Digital Avionics Systems Conference (DASC).

[18]  David Lo,et al.  CrossASR: Efficient Differential Testing of Automatic Speech Recognition via Text-To-Speech , 2020, 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[19]  Yang Liu,et al.  Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems , 2019, ArXiv.

[20]  Oliver Ohneiser,et al.  Reducing controller workload with automatic speech recognition , 2016, 2016 IEEE/AIAA 35th Digital Avionics Systems Conference (DASC).

[21]  Ian Goodfellow,et al.  TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing , 2018, ICML.

[22]  Tsong Yueh Chen,et al.  METTLE: A METamorphic Testing Approach to Assessing and Validating Unsupervised Machine Learning Systems , 2018, IEEE Transactions on Reliability.

[23]  Julia Rubin,et al.  Fairness Definitions Explained , 2018, 2018 IEEE/ACM International Workshop on Software Fairness (FairWare).

[24]  Tao Chen,et al.  Analysis of Speaker Variability , 2022 .

[25]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[26]  Chao Shen,et al.  Audee: Automated Testing for Deep Learning Frameworks , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[27]  Xingjun Ma,et al.  RobOT: Robustness-Oriented Testing for Deep Learning Systems , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).

[28]  Preethi Jyothi,et al.  Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning , 2018, INTERSPEECH.

[29]  Colin Raffel,et al.  Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition , 2019, ICML.

[30]  Sudipta Chattopadhyay,et al.  Astraea: Grammar-based Fairness Testing , 2022, IEEE Transactions on Software Engineering.

[31]  Heike Wehrheim,et al.  Higher income, larger loan? monotonicity testing of machine learning models , 2020, ISSTA.

[32]  Daniel Kroening,et al.  Concolic Testing for Deep Neural Networks , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[33]  Anne Phillips,et al.  Defending equality of outcome , 2004 .

[34]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[35]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[36]  Hridesh Rajan,et al.  DeepLocalize: Fault Localization for Deep Neural Networks , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).

[37]  Lei Ma,et al.  DeepHunter: a coverage-guided fuzz testing framework for deep neural networks , 2019, ISSTA.

[38]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[39]  Carlos Guestrin,et al.  Anchors: High-Precision Model-Agnostic Explanations , 2018, AAAI.

[40]  Moustapha Cissé,et al.  Fooling End-To-End Speaker Verification With Adversarial Examples , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Takashi Fukuda,et al.  Automated Testing of Basic Recognition Capability for Speech Recognition Systems , 2019, 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST).

[42]  Dan Jurafsky,et al.  Racial disparities in automated speech recognition , 2020, Proceedings of the National Academy of Sciences.

[43]  Suman Jana,et al.  DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[44]  Emily Denton,et al.  Image Counterfactual Sensitivity Analysis for Detecting Unintended Bias , 2019 .

[45]  Paolo Arcaini,et al.  Simultaneously searching and solving multiple avoidable collisions for testing autonomous driving systems , 2020, GECCO.

[46]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[47]  Junfeng Yang,et al.  DeepXplore: Automated Whitebox Testing of Deep Learning Systems , 2017, SOSP.

[48]  Toniann Pitassi,et al.  Fairness through awareness , 2011, ITCS '12.

[49]  Don H. Johnson,et al.  Signal-to-noise ratio , 2006, Scholarpedia.

[50]  Isin Demirsahin,et al.  Open-source Multi-speaker Corpora of the English Accents in the British Isles , 2020, LREC.

[51]  D. Kroening,et al.  Explaining Image Classifiers Using Statistical Fault Localization , 2020, ECCV.

[52]  David A. Wagner,et al.  Audio Adversarial Examples: Targeted Attacks on Speech-to-Text , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[53]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[54]  Yuriy Brun,et al.  Fairness testing: testing software for discrimination , 2017, ESEC/SIGSOFT FSE.

[55]  Mark Harman,et al.  "Ignorance and Prejudice" in Software Fairness , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).