Generating Robust Audio Adversarial Examples with Temporal Dependency

Audio adversarial examples, imperceptible to humans, have been constructed to attack automatic speech recognition (ASR) systems. However, the adversarial examples generated by existing approaches usually incorporate noticeable noises, especially during the periods of silences and pauses. Moreover, the added noises often break temporal dependency property of the original audio, which can be easily detected by state-of-the-art defense mechanisms. In this paper, we propose a new Iterative Proportional Clipping (IPC) algorithm that preserves temporal dependency in audios for generating more robust adversarial examples. We are motivated by an observation that the temporal dependency in audios imposes a significant effect on human perception. Following our observation, we leverage a proportional clipping strategy to reduce noise during the low-intensity periods. Experimental results and user study both suggest that the generated adversarial examples can significantly reduce human-perceptible noises and resist the defenses based on the temporal structure.

[1]  Methods for the subjective assessment of small impairments in audio systems , 2015 .

[2]  Moustapha Cissé,et al.  Houdini: Fooling Deep Structured Prediction Models , 2017, ArXiv.

[3]  J. J. Marine Biology , 1914, Nature.

[4]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[5]  Micah Sherr,et al.  Cocaine Noodles: Exploiting the Gap between Human and Machine Speech Recognition , 2015, WOOT.

[6]  Hanqing Guo,et al.  SurfingAttack: Interactive Hidden Attack on Voice Assistants Using Ultrasonic Guided Waves , 2020, NDSS.

[7]  Christian Poellabauer,et al.  Crafting Adversarial Examples For Speech Paralinguistics Applications , 2017, ArXiv.

[8]  Wenyuan Xu,et al.  DolphinAttack: Inaudible Voice Commands , 2017, CCS.

[9]  Fabio Roli,et al.  Evasion Attacks against Machine Learning at Test Time , 2013, ECML/PKDD.

[10]  Yue Zhao,et al.  CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition , 2018, USENIX Security Symposium.

[11]  Gabriel Synnaeve,et al.  Letter-Based Speech Recognition with Gated ConvNets , 2017, ArXiv.

[12]  Gabriel Synnaeve,et al.  Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[13]  Dorothea Kolossa,et al.  Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding , 2018, NDSS.

[14]  Micah Sherr,et al.  Hidden Voice Commands , 2016, USENIX Security Symposium.

[15]  Hiromu Yakura,et al.  Robust Audio Adversarial Example for a Physical Attack , 2018, IJCAI.

[16]  David A. Wagner,et al.  Audio Adversarial Examples: Targeted Attacks on Speech-to-Text , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[17]  Patrick Traynor,et al.  Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems , 2019, NDSS.

[18]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Birger Kollmeier,et al.  PEMO-Q—A New Method for Objective Audio Quality Assessment Using a Model of Auditory Perception , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Zhuolin Yang,et al.  Characterizing Audio Adversarial Examples Using Temporal Dependency , 2018, ICLR.