HVAC: Evading Classifier-based Defenses in Hidden Voice Attacks

Recent years have witnessed the rapid development of automatic speech recognition (ASR) systems, providing a practical voice-user interface for widely deployed smart devices. With the ever-growing deployment of such an interface, several voice-based attack schemes have been proposed towards current ASR systems to exploit certain vulnerabilities. Posing one of the more serious threats,hidden voice attack uses the human-machine perception gap to generate obfuscated/hidden voice commands that are unintelligible to human listeners but can be interpreted as commands by machines. However, due to the nature of hidden voice commands (i.e., normal and obfuscated samples exhibit a significant difference in their acoustic features), recent studies show that they can be easily detected and defended by a pre-trained classifier, thereby making it less threatening. In this paper, we validate that such a defense strategy can be circumvented with a more advanced type of hidden voice attack calledHVAC. Our proposed HVAC attack can easily bypass the existing learning-based defense classifiers while preserving all the essential characteristics of hidden voice attacks (i.e., unintelligible to humans and recognizable to machines). Specifically, we find that all classifier-based defenses build on top of classification models that are trained with acoustic features extracted from the entire audio of normal and obfuscated samples. However, only speech parts (i.e., human voice parts) of these samples contain the useful linguistic information needed for machine transcription. We thus propose a fusion-based method to combine the normal sample and corresponding obfuscated sample as a hybrid HVAC command, which can effectively cheat the defense classifiers. Moreover, to make the command more unintelligible to humans, we tune the speed and pitch of the sample and make it even more distorted in the time domain while ensuring it can still be recognized by machines. Extensive physical over-the-air experiments demonstrate the robustness and generalizability of our HVAC attack under different realistic attack scenarios. Results show that our HVAC commands can achieve an average 94.1% success rate of bypassing machine-learning-based defense approaches under various realistic settings.

[1]  Micah Sherr,et al.  Hidden Voice Commands , 2016, USENIX Security Symposium.

[2]  Michael Picheny,et al.  The IBM expressive text-to-speech synthesis system for American English , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Patrick Traynor,et al.  SoK: The Faults in our ASRs: An Overview of Attacks against Automatic Speech Recognition and Speaker Identification Systems , 2020, 2021 IEEE Symposium on Security and Privacy (SP).

[4]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[5]  Micah Sherr,et al.  Cocaine Noodles: Exploiting the Gap between Human and Machine Speech Recognition , 2015, WOOT.

[6]  Wenyuan Xu,et al.  DolphinAttack: Inaudible Voice Commands , 2017, CCS.

[7]  Dorothea Kolossa,et al.  Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding , 2018, NDSS.

[8]  T. D. Hanley,et al.  Effect of level of distracting noise upon speaking rate, duration and intensity. , 1949, The Journal of speech disorders.

[9]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[10]  مسعود رسول آبادی,et al.  2011 , 2012, The Winning Cars of the Indianapolis 500.

[11]  Facebook,et al.  Houdini : Fooling Deep Structured Visual and Speech Recognition Models with Adversarial Examples , 2017 .

[12]  Ashish Semwal,et al.  English Language Speech Recognition Using MFCC and HMM , 2018, 2018 International Conference on Research in Intelligent and Computing in Engineering (RICE).

[13]  Ondrej Dusek,et al.  Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license , 2014, LREC.

[14]  Stephen DiVerdi,et al.  VoCo , 2017, ACM Trans. Graph..

[15]  Romit Roy Choudhury,et al.  BackDoor: Making Microphones Hear Inaudible Sounds , 2017, MobiSys.

[16]  Saifur Rahman,et al.  SPEAKER IDENTIFICATION USING MEL FREQUENCY CEPSTRAL COEFFICIENTS , 2004 .

[17]  Dan Iter,et al.  Generating Adversarial Examples for Speech Recognition , 2017 .

[18]  Jennifer Urner Forensic Speaker Identification , 2016 .

[19]  Christian Poellabauer,et al.  Crafting Adversarial Examples For Speech Paralinguistics Applications , 2017, ArXiv.

[20]  Tao Chen,et al.  Metamorph: Injecting Inaudible Commands into Over-the-air Voice Controlled Systems , 2020, NDSS.

[21]  David A. Wagner,et al.  Audio Adversarial Examples: Targeted Attacks on Speech-to-Text , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[22]  Patrick Traynor,et al.  Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems , 2019, NDSS.

[23]  Chen Wang,et al.  Defeating hidden audio channel attacks on voice assistants via audio-induced surface vibrations , 2019, ACSAC.

[24]  Qian Wang,et al.  Hidden Voice Commands: Attacks and Defenses on the VCS of Autonomous Driving Cars , 2019, IEEE Wireless Communications.

[25]  Theodoros Giannakopoulos pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis , 2015, PloS one.

[26]  S. Ridout,et al.  “Did you hear?” , 2015, Medical Humanities.

[27]  Yue Zhao,et al.  CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition , 2018, USENIX Security Symposium.

[28]  Mani B. Srivastava,et al.  Did you hear that? Adversarial Examples Against Automatic Speech Recognition , 2018, ArXiv.

[29]  Aki Härmä Automatic identification of bird species based on sinusoidal modeling of syllables , 2003, ICASSP.

[30]  Meinard Müller,et al.  TSM Toolbox: MATLAB Implementations of Time-Scale Modification Algorithms , 2014, DAFx.

[31]  Colin Raffel,et al.  Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition , 2019, ICML.