CrossASR: Efficient Differential Testing of Automatic Speech Recognition via Text-To-Speech

Automatic speech recognition (ASR) systems are ubiquitous parts of modern life. It can be found in our smartphones, desktops, and smart home systems. To ensure its correctness in recognizing speeches, ASR needs to be tested. Testing ASR requires test cases in the form of audio files and their transcribed texts. Building these test cases manually, however, is tedious and time-consuming.To deal with the aforementioned challenge, in this work, we propose CrossASR, an approach that capitalizes the existing Text-To-Speech (TTS) systems to automatically generate test cases for ASR systems. CrossASR is a differential testing solution that compares outputs of multiple ASR systems to uncover erroneous behaviors among ASRs. CrossASR efficiently generates test cases to uncover failures with as few generated tests as possible; it does so by employing a failure probability predictor to pick the texts with the highest likelihood of leading to failed test cases. As a black-box approach, CrossASR can generate test cases for any ASR, including when the ASR model is not available (e.g., when evaluating the reliability of various third-party ASR services).We evaluated CrossASR using 4 TTSes and 4 ASRs on the Europarl corpus. The experimented ASRs are Deepspeech, Deepspeech2, wav2letter, and wit. Our experiments on a randomly sampled 20,000 English texts showed that within an hour, CrossASR can produce, on average from 3 experiments, 130.34, 123.33, 47.33, and 8.66 failed test cases using Google, Respon-siveVoice, Festival, and Espeak TTSes, respectively. Moreover, when we run CrossASR on the entire 20,000 texts, it can generate 13,572, 13,071, 5,911, and 1,064 failed test cases using Google, ResponsiveVoice, Festival, and Espeak TTSes, respectively. Based on a manual verification carried out on statistically representative sample size, we found that most samples are actual failed test cases (audio understandable to humans but cannot be transcribed properly by an ASR), demonstrating that CrossASR is highly reliable in determining failed test cases. We also make the source code for CrossASR and evaluation data available at https://github.com/soarsmu/CrossASR.

[1]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[2]  Sarfraz Khurshid,et al.  DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[3]  Rashmi Agrawal,et al.  Mutation Testing and Test Data Generation Approaches: A Review , 2016 .

[4]  Mira Dontcheva,et al.  Vocal Shortcuts for Creative Experts , 2019, CHI.

[5]  Paula Buttery,et al.  A Text Normalisation System for Non-Standard English Words , 2017, NUT@EMNLP.

[6]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[7]  Daniel Kroening,et al.  DeepConcolic: Testing and Debugging Deep Neural Networks , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion).

[8]  Senthil Mani,et al.  Adversarial Black-Box Attacks for Automatic Speech Recognition Systems Using Multi-Objective Genetic Optimization , 2018, ArXiv.

[9]  David A. Wagner,et al.  Audio Adversarial Examples: Targeted Attacks on Speech-to-Text , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[10]  Rabins Porwal,et al.  Automated Test Data Generation Applying Heuristic Approaches—A Survey , 2019 .

[11]  Jinfeng Yi,et al.  Is Robustness the Cost of Accuracy? - A Comprehensive Study on the Robustness of 18 Deep Image Classification Models , 2018, ECCV.

[12]  Raees Ahmad Khan,et al.  Software Engineering: A Practitioners Approach , 2014 .

[13]  Yue Zhao,et al.  DLFuzz: differential fuzzing testing of deep learning systems , 2018, ESEC/SIGSOFT FSE.

[14]  Haijun Wang,et al.  DiffChaser: Detecting Disagreements for Deep Neural Networks , 2019, IJCAI.

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Sudipta Chattopadhyay,et al.  Grammar Based Directed Testing of Machine Learning Systems , 2019, ArXiv.

[17]  Junfeng Yang,et al.  DeepXplore: Automated Whitebox Testing of Deep Learning Systems , 2017, SOSP.

[18]  Jianjun Zhao,et al.  DeepStellar: model-based quantitative analysis of stateful deep learning systems , 2019, ESEC/SIGSOFT FSE.

[19]  Nikita Vemuri,et al.  Targeted Adversarial Examples for Black Box Audio Systems , 2018, 2019 IEEE Security and Privacy Workshops (SPW).

[20]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[21]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[22]  Yi Li,et al.  DeepCruiser: Automated Guided Testing for Stateful Deep Learning Systems , 2018, ArXiv.

[23]  Gabriel Synnaeve,et al.  Wav2Letter++: A Fast Open-source Speech Recognition System , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Suman Jana,et al.  DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[25]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[26]  Patrick Traynor,et al.  Hear "No Evil", See "Kenansville": Efficient and Transferable Black-Box Attacks on Speech Recognition and Voice Identification Systems , 2019, ArXiv.

[27]  Lei Ma,et al.  DeepMutation: Mutation Testing of Deep Learning Systems , 2018, 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE).

[28]  Yi Qin,et al.  SynEva: Evaluating ML Programs by Mirror Program Synthesis , 2018, 2018 IEEE International Conference on Software Quality, Reliability and Security (QRS).

[29]  Mark Harman,et al.  Machine Learning Testing: Survey, Landscapes and Horizons , 2019, IEEE Transactions on Software Engineering.

[30]  Andrew Begel,et al.  An Assessment of a Speech-Based Programming Environment , 2006, Visual Languages and Human-Centric Computing (VL/HCC'06).

[31]  Colin Raffel,et al.  Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition , 2019, ICML.

[32]  Lei Ma,et al.  DeepHunter: a coverage-guided fuzz testing framework for deep neural networks , 2019, ISSTA.

[33]  Lin Tan,et al.  CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[34]  Jonathan G. Fiscus,et al.  Tools for the analysis of benchmark speech recognition tests , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[35]  Dorothea Kolossa,et al.  Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding , 2018, NDSS.

[36]  Mark Harman,et al.  Automatic Testing and Improvement of Machine Translation , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).