On the human evaluation of audio adversarial examples

Human-machine interaction is increasingly dependent on speech communication. Machine Learning models are usually applied to interpret human speech commands. However, these models can be fooled by adversarial examples, which are inputs intentionally perturbed to produce a wrong prediction without being noticed. While much research has been focused on developing new techniques to generate adversarial perturbations, less attention has been given to aspects that determine whether and how the perturbations are noticed by humans. This question is relevant since high fooling rates of proposed adversarial perturbation strategies are only valuable if the perturbations are not detectable. In this paper we investigate to which extent the distortion metrics proposed in the literature for audio adversarial examples, and which are commonly applied to evaluate the effectiveness of methods for generating these attacks, are a reliable measure of the human perception of the perturbations. Using an analytical framework, and an experiment in which 18 subjects evaluate audio adversarial examples, we demonstrate that the metrics employed by convention are not a reliable measure of the perceptual similarity of adversarial examples in the audio domain.

[1]  Micah Sherr,et al.  Hidden Voice Commands , 2016, USENIX Security Symposium.

[2]  Christian Poellabauer,et al.  An Overview of Vulnerabilities of Voice Controlled Systems , 2018, ArXiv.

[3]  Wuchen Li,et al.  Wasserstein of Wasserstein Loss for Learning Generative Models , 2019, ICML.

[4]  Bob L. Sturm,et al.  Deep Learning and Music Adversaries , 2015, IEEE Transactions on Multimedia.

[5]  Peter Howell,et al.  Signals and Systems for Speech and Hearing , 1991 .

[6]  Roberto Santana,et al.  Universal adversarial examples in speech command classification , 2019, ArXiv.

[7]  Ajmal Mian,et al.  Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey , 2018, IEEE Access.

[8]  Juan José Pantrigo,et al.  Convolutional Neural Networks and Long Short-Term Memory for skeleton-based human activity and hand gesture recognition , 2018, Pattern Recognit..

[9]  Steven W. Smith,et al.  The Scientist and Engineer's Guide to Digital Signal Processing , 1997 .

[10]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Kang G. Shin,et al.  Continuous Authentication for Voice Assistants , 2017, MobiCom.

[12]  Alexandros G. Dimakis,et al.  Quantifying Perceptual Distortion of Adversarial Examples , 2019, ArXiv.

[13]  P. S. Sathidevi,et al.  Mitigating effects of noise in Forensic Speaker Recognition , 2017, 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET).

[14]  Farinaz Koushanfar,et al.  Universal Adversarial Perturbations for Speech Recognition Systems , 2019, INTERSPEECH.

[15]  Mani B. Srivastava,et al.  Did you hear that? Adversarial Examples Against Automatic Speech Recognition , 2018, ArXiv.

[16]  Yundong Zhang,et al.  Hello Edge: Keyword Spotting on Microcontrollers , 2017, ArXiv.

[17]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[18]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[19]  Seyed-Mohsen Moosavi-Dezfooli,et al.  Universal Adversarial Perturbations , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Lihong Li,et al.  Neural Approaches to Conversational AI , 2019, Found. Trends Inf. Retr..

[21]  Hiromu Yakura,et al.  Robust Audio Adversarial Example for a Physical Attack , 2018, IJCAI.

[22]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[23]  Romit Roy Choudhury,et al.  BackDoor: Making Microphones Hear Inaudible Sounds , 2017, MobiSys.

[24]  Xiaogang Wang,et al.  Deep Learning Face Representation by Joint Identification-Verification , 2014, NIPS.

[25]  Belhassen Bayar,et al.  Constrained Convolutional Neural Networks: A New Approach Towards General Purpose Image Manipulation Detection , 2018, IEEE Transactions on Information Forensics and Security.

[26]  Paul Rad,et al.  Voice biometrics: Deep learning-based voiceprint authentication system , 2017, 2017 12th System of Systems Engineering Conference (SoSE).

[27]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[28]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[29]  Frederik Nagel,et al.  Audio quality evaluation by experienced and inexperienced listeners , 2013 .

[30]  Moustapha Cissé,et al.  Houdini: Fooling Deep Structured Prediction Models , 2017, ArXiv.

[31]  Ting Wang,et al.  SirenAttack: Generating Adversarial Audio for End-to-End Acoustic Systems , 2019, AsiaCCS.

[32]  Wassim Hamidouche,et al.  Perceptual Evaluation of Adversarial Attacks for CNN-based Image Classification , 2019, 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX).

[33]  Tara N. Sainath,et al.  Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[34]  Micah Sherr,et al.  Cocaine Noodles: Exploiting the Gap between Human and Machine Speech Recognition , 2015, WOOT.

[35]  Belhassen Bayar,et al.  A Generic Approach Towards Image Manipulation Parameter Estimation Using Convolutional Neural Networks , 2017, IH&MMSec.

[36]  Jinhua Zeng,et al.  Deep learning based forensic face verification in videos , 2017, 2017 International Conference on Progress in Informatics and Computing (PIC).

[37]  Wenyuan Xu,et al.  DolphinAttack: Inaudible Voice Commands , 2017, CCS.

[38]  Yue Zhao,et al.  CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition , 2018, USENIX Security Symposium.

[39]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[40]  Dorothea Kolossa,et al.  Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding , 2018, NDSS.

[41]  Christian Poellabauer,et al.  Crafting Adversarial Examples For Speech Paralinguistics Applications , 2017, ArXiv.

[42]  Seyed-Mohsen Moosavi-Dezfooli,et al.  DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Zhuolin Yang,et al.  Characterizing Audio Adversarial Examples Using Temporal Dependency , 2018, ICLR.

[44]  Patrick Cardinal,et al.  Universal Adversarial Audio Perturbations , 2019, ArXiv.

[45]  Mani B. Srivastava,et al.  Generating Natural Language Adversarial Examples , 2018, EMNLP.

[46]  Chunfang Liu,et al.  3D human gesture capturing and recognition by the IMMU-based data glove , 2018, Neurocomputing.

[47]  Moustapha Cissé,et al.  Fooling End-To-End Speaker Verification With Adversarial Examples , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  David A. Wagner,et al.  Audio Adversarial Examples: Targeted Attacks on Speech-to-Text , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[49]  Hugo Fastl,et al.  Psychoacoustics: Facts and Models , 1990 .

[50]  Ahmad Almogren,et al.  A robust human activity recognition system using smartphone sensors and deep learning , 2018, Future Gener. Comput. Syst..