CrossASR++: a modular differential testing framework for automatic speech recognition

Developers need to perform adequate testing to ensure the quality of Automatic Speech Recognition (ASR) systems. However, manually collecting required test cases is tedious and time-consuming. Our recent work proposes CrossASR, a differential testing method for ASR systems. This method first utilizes Text-to-Speech (TTS) to generate audios from texts automatically and then feed these audios into different ASR systems for cross-referencing to uncover failed test cases. It also leverages a failure estimator to find failing test cases more efficiently. Such a method is inherently self-improvable: the performance can increase by leveraging more advanced TTS and ASR systems. So, in this accompanying tool demo paper, we further engineer CrossASR and propose CrossASR++, an easy-to-use ASR testing tool that can be conveniently extended to incorporate different TTS and ASR systems, and failure estimators. We also make CrossASR++ chunk texts from a given corpus dynamically and enable the estimator to work in a more effective and flexible way. We demonstrate that the new features can help CrossASR++ discover more failed test cases. Using the same TTS and ASR systems, CrossASR++ can uncover 26.2% more failed test cases for 4 ASRs than the original tool. Moreover, by simply adding one more ASR for cross-referencing, we can increase the number of failed test cases uncovered for each of the 4 ASR systems by 25.07%, 39.63%, 20.95% and 8.17% respectively. We also extend CrossASR++ with 5 additional failure estimators. Compared to worst estimator, the best one can discover 10.41% more failed test cases within the same amount of time. The demo video for CrossASR++ can be viewed at https://youtu.be/ddRk-f0QV-g and the source code can be found at https://github.com/soarsmu/CrossASRplus.

[1]  David Lo,et al.  CrossASR: Efficient Differential Testing of Automatic Speech Recognition via Text-To-Speech , 2020, 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[2]  Colin Raffel,et al.  Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition , 2019, ICML.

[3]  Senthil Mani,et al.  Adversarial Black-Box Attacks for Automatic Speech Recognition Systems Using Multi-Objective Genetic Optimization , 2018, ArXiv.

[4]  Gabriel Synnaeve,et al.  Wav2Letter++: A Fast Open-source Speech Recognition System , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[6]  Dorothea Kolossa,et al.  Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding , 2018, NDSS.

[7]  Aaron Weiss,et al.  Hear no evil , 2000, NTWK.

[8]  David A. Wagner,et al.  Audio Adversarial Examples: Targeted Attacks on Speech-to-Text , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[9]  Heyuan Shi,et al.  RNN-Test: Towards Adversarial Testing for Recurrent Neural Network Systems , 2021, IEEE Transactions on Software Engineering.

[10]  Nikita Vemuri,et al.  Targeted Adversarial Examples for Black Box Audio Systems , 2018, 2019 IEEE Security and Privacy Workshops (SPW).

[11]  Muhammad Ali Gulzar,et al.  Perception and Practices of Differential Testing , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[12]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[13]  Yi Li,et al.  DeepCruiser: Automated Guided Testing for Stateful Deep Learning Systems , 2018, ArXiv.

[14]  Senthil Mani,et al.  Adversarial Black-Box Attacks on Automatic Speech Recognition Systems Using Multi-Objective Evolutionary Optimization , 2018, INTERSPEECH.

[15]  Patrick Traynor,et al.  Hear "No Evil", See "Kenansville": Efficient and Transferable Black-Box Attacks on Speech Recognition and Voice Identification Systems , 2019, ArXiv.

[16]  Jonathan G. Fiscus,et al.  Tools for the analysis of benchmark speech recognition tests , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[17]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.