Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems

Automatic speech recognition (ASR) systems can be fooled via targeted adversarial examples, which induce the ASR to produce arbitrary transcriptions in response to altered audio signals. However, state-of-the-art adversarial examples typically have to be fed into the ASR system directly, and are not successful when played in a room. Previously published over-the-air adversarial examples fall into one of three categories: they are either handcrafted examples, they are so conspicuous that human listeners can easily recognize the target transcription once they are alerted to its content, or they require precise information about the room where the attack takes place, and are hence not transferable to other rooms. In this paper, we demonstrate the first algorithm that produces generic adversarial examples against hybrid ASR systems, which remain robust in an over-the-air attack that is not adapted to the specific environment. Hence, no prior knowledge of the room characteristics is required. Instead, we use room impulse responses (RIRs) to compute robust adversarial examples for arbitrary room characteristics and employ the ASR system Kaldi to demonstrate the attack. Further, our algorithm can utilize psychoacoustic methods to hide changes of the original audio signal below the human thresholds of hearing. In practical experiments, we show that the adversarial examples work for varying room setups, and that no direct line-of-sight between speaker and microphone is necessary. As a result, an attacker can create inconspicuous adversarial examples for any target transcription and apply these to arbitrary room setups without any prior knowledge.

[1]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[2]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[3]  Ananthram Swami,et al.  Practical Black-Box Attacks against Machine Learning , 2016, AsiaCCS.

[4]  Kai Chen,et al.  Devil's Whisper: A General Approach for Physical Adversarial Attacks against Commercial Black-box Speech Recognition Devices , 2020, USENIX Security Symposium.

[5]  Colin Raffel,et al.  Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition , 2019, ICML.

[6]  Logan Engstrom,et al.  Black-box Adversarial Attacks with Limited Queries and Information , 2018, ICML.

[7]  J. Zico Kolter,et al.  Adversarial Music: Real World Audio Adversary Against Wake-word Detection System , 2019, NeurIPS.

[8]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[9]  Fan Zhang,et al.  Stealing Machine Learning Models via Prediction APIs , 2016, USENIX Security Symposium.

[10]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[11]  Lei Xie,et al.  Unsupervised Adaptation with Adversarial Dropout Regularization for Robust Speech Recognition , 2019, INTERSPEECH.

[12]  Moustapha Cissé,et al.  Fooling End-To-End Speaker Verification With Adversarial Examples , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Zhuolin Yang,et al.  Characterizing Audio Adversarial Examples Using Temporal Dependency , 2018, ICLR.

[14]  Patrick Cardinal,et al.  Universal Adversarial Audio Perturbations , 2019, ArXiv.

[15]  Nikita Vemuri,et al.  Targeted Adversarial Examples for Black Box Audio Systems , 2018, 2019 IEEE Security and Privacy Workshops (SPW).

[16]  Atul Prakash,et al.  Robust Physical-World Attacks on Machine Learning Models , 2017, ArXiv.

[17]  Jugal K. Kalita,et al.  Noise Flooding for Detecting Audio Adversarial Examples Against Automatic Speech Recognition , 2018, 2018 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT).

[18]  Micah Sherr,et al.  Hidden Voice Commands , 2016, USENIX Security Symposium.

[19]  Mitali Bafna,et al.  Thwarting Adversarial Examples: An L_0-Robust Sparse Fourier Transform , 2018, NeurIPS.

[20]  David A. Wagner,et al.  Audio Adversarial Examples: Targeted Attacks on Speech-to-Text , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[21]  Patrick Cardinal,et al.  A Robust Approach for Securing Audio Classification Against Adversarial Attacks , 2019, IEEE Transactions on Information Forensics and Security.

[22]  Hugo Fastl,et al.  Psychoacoustics: Facts and Models , 1990 .

[23]  Romit Roy Choudhury,et al.  BackDoor: Making Microphones Hear Inaudible Sounds , 2017, MobiSys.

[24]  Wonho Yang,et al.  Enhanced modified bark spectral distortion (embsd): an objective speech quality measure based on audible distortion and cognition model , 1999 .

[25]  Aleksander Madry,et al.  On Evaluating Adversarial Robustness , 2019, ArXiv.

[26]  Patrick D. McDaniel,et al.  Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples , 2016, ArXiv.

[27]  Mei-Yuh Hwang,et al.  Training Augmentation with Adversarial Examples for Robust Speech Recognition , 2018, INTERSPEECH.

[28]  Hermann Ney,et al.  RWTH ASR Systems for LibriSpeech: Hybrid vs Attention - w/o Data Augmentation , 2019, INTERSPEECH.

[29]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[30]  J. Zico Kolter,et al.  Perceptual Based Adversarial Audio Attacks , 2019, ArXiv.

[31]  Matthias Bethge,et al.  Towards the first adversarially robust neural network model on MNIST , 2018, ICLR.

[32]  Farinaz Koushanfar,et al.  Universal Adversarial Perturbations for Speech Recognition Systems , 2019, INTERSPEECH.

[33]  Mani B. Srivastava,et al.  Did you hear that? Adversarial Examples Against Automatic Speech Recognition , 2018, ArXiv.

[34]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[35]  Senthil Mani,et al.  Adversarial Black-Box Attacks on Automatic Speech Recognition Systems Using Multi-Objective Evolutionary Optimization , 2018, INTERSPEECH.

[36]  John H. L. Hansen,et al.  Adversarial Regularization for End-to-End Robust Speaker Verification , 2019, INTERSPEECH.

[37]  Patrick Traynor,et al.  Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems , 2019, NDSS.

[38]  Tao Chen,et al.  Metamorph: Injecting Inaudible Commands into Over-the-air Voice Controlled Systems , 2020, NDSS.

[39]  Moustapha Cissé,et al.  Houdini: Fooling Deep Structured Prediction Models , 2017, ArXiv.

[40]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[41]  Hiromu Yakura,et al.  Robust Audio Adversarial Example for a Physical Attack , 2018, IJCAI.

[42]  Ahmad-Reza Sadeghi,et al.  Alexa Lied to Me: Skill-based Man-in-the-Middle Attacks on Virtual Assistants , 2019, AsiaCCS.

[43]  Aleksander Madry,et al.  Adversarial Examples Are Not Bugs, They Are Features , 2019, NeurIPS.

[44]  Jianhai Su,et al.  A Multiversion Programming Inspired Approach to Detecting Audio Adversarial Examples , 2018, 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[45]  S. Voranl,et al.  Perception-based Objective Estimators of Speech , 1995, Proceedings. IEEE Workshop on Speech Coding for Telecommunications.

[46]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[47]  Dorothea Kolossa,et al.  Detecting Adversarial Examples for Speech Recognition via Uncertainty Quantification , 2020, INTERSPEECH.

[48]  Nan Zhang,et al.  Dangerous Skills: Understanding and Mitigating Security Risks of Voice-Controlled Third-Party Functions on Virtual Personal Assistant Systems , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[49]  Micah Sherr,et al.  Cocaine Noodles: Exploiting the Gap between Human and Machine Speech Recognition , 2015, WOOT.

[50]  Deepak Kumar,et al.  Skill Squatting Attacks on Amazon Alexa , 2018, USENIX Security Symposium.

[51]  Dorothea Kolossa,et al.  Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding , 2018, NDSS.

[52]  Logan Engstrom,et al.  Synthesizing Robust Adversarial Examples , 2017, ICML.

[53]  Wenyuan Xu,et al.  DolphinAttack: Inaudible Voice Commands , 2017, CCS.

[54]  Binghui Wang,et al.  Stealing Hyperparameters in Machine Learning , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[55]  Yue Zhao,et al.  CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition , 2018, USENIX Security Symposium.

[56]  Max Welling,et al.  Structured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors , 2016, ICML.

[57]  Prateek Mittal,et al.  POSTER: Inaudible Voice Commands , 2017, CCS.