Fast speech adversarial example generation for keyword spotting system with conditional GAN

Abstract Deep network-based keyword spotting (KWS) has embraced great success in many speech assistant applications. However, such network-based KWS systems were demonstrated vulnerable to adversarial attacks. In this work, we propose to utilize a conditional generative adversarial network (CGAN) to efficiently craft targeted speech adversarial examples. Specifically, we first transform the attacking target label into a vector, which is treated as the condition input of CGAN. The generator in CGAN is tasked to generate perturbation that could make the adversarial example misclassified as the pre-specified target keyword, while simultaneously deceiving the discriminator to misclassify the adversarial example as genuine. The discriminator aims to differentiate the crafted adversarial examples from the legitimate samples. Secondly, the target network-based KWS classifier(s) are ensembled and integrated into the proposed CGAN framework to enforce the generator to construct model-independent perturbation. The classification error loss of the target KWS is back-propagated through gradients for guiding the weight update of the generator. Finally, with properly devised network architecture and training procedure, we obtain a well-trained generator that generates the adversarial perturbation for a given speech clip and target label. Experimental results show that the crafted adversarial examples could effectively attack the state-of-the-art KWS system with quite a high attack success rate, while attaining acceptable perception quality.

[1]  Ruidong Li,et al.  Generative adversarial networks enhanced location privacy in 5G networks , 2020, Science China Information Sciences.

[2]  Yonghong Tian,et al.  GAN-Driven Personalized Spatial-Temporal Private Data Sharing in Cyber-Physical Social Systems , 2020, IEEE Transactions on Network Science and Engineering.

[3]  Tetsuya Shimamura,et al.  Advances in Audio Watermarking Based on Singular Value Decomposition , 2015 .

[4]  Yue Zhao,et al.  CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition , 2018, USENIX Security Symposium.

[5]  Wei Wei,et al.  An Attention-Based Deep Learning Framework for Trip Destination Prediction of Sharing Bike , 2021, IEEE Transactions on Intelligent Transportation Systems.

[6]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[9]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[10]  Li Dong,et al.  Anti-Forensics of Audio Source Identification Using Generative Adversarial Network , 2019, IEEE Access.

[11]  Ping Yu,et al.  Generating Adversarial Examples With Conditional Generative Adversarial Net , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[12]  Xiping Hu,et al.  Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition , 2020, Information Sciences.

[13]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[14]  Li Dong,et al.  Targeted Speech Adversarial Example Generation With Generative Adversarial Network , 2020, IEEE Access.

[15]  Shuicheng Yan,et al.  Dual Path Networks , 2017, NIPS.

[16]  Mingyan Liu,et al.  Generating Adversarial Examples with Adversarial Networks , 2018, IJCAI.

[17]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[18]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Yifan Gong,et al.  Adversarial Speaker Verification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Yong Yu,et al.  Efficient Architecture Search by Network Transformation , 2017, AAAI.

[21]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[22]  Yang Song,et al.  Constructing Unrestricted Adversarial Examples with Generative Models , 2018, NeurIPS.

[23]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[24]  Hyunsoo Yoon,et al.  Selective Audio Adversarial Example in Evasion Attack on Speech Recognition System , 2020, IEEE Transactions on Information Forensics and Security.

[25]  Dawn Xiaodong Song,et al.  Differentiable Neural Network Architecture Search , 2018, ICLR.

[26]  David A. Wagner,et al.  Audio Adversarial Examples: Targeted Attacks on Speech-to-Text , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[27]  Jian Liu,et al.  Enabling Fast and Universal Audio Adversarial Attack Using Generative Model , 2020, AAAI.

[28]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[29]  Yang Liu,et al.  Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems , 2019, ArXiv.

[30]  Chris Donahue,et al.  Adversarial Audio Synthesis , 2018, ICLR.

[31]  Wanlei Zhou,et al.  GAN-DP: Generative Adversarial Net Driven Differentially Privacy-Preserving Big Data Publishing , 2019, ICC 2019 - 2019 IEEE International Conference on Communications (ICC).

[32]  Ying Tan,et al.  Generating Adversarial Malware Examples for Black-Box Attacks Based on GAN , 2017, DMBD.

[33]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[34]  Mani B. Srivastava,et al.  Did you hear that? Adversarial Examples Against Automatic Speech Recognition , 2018, ArXiv.

[35]  Efficient Generation of Speech Adversarial Examples with Generative Model , 2020, IWDW.

[36]  Farinaz Koushanfar,et al.  Universal Adversarial Perturbations for Speech Recognition Systems , 2019, INTERSPEECH.

[37]  Phalguni Gupta,et al.  A perceptible watermarking algorithm for audio signals , 2011, Multimedia Tools and Applications.

[38]  Jinxin Liu,et al.  A probabilistic risk assessment framework considering lane-changing behavior interaction , 2020, Science China Information Sciences.

[39]  Bob L. Sturm,et al.  Deep Learning and Music Adversaries , 2015, IEEE Transactions on Multimedia.

[40]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[41]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Colin Raffel,et al.  Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition , 2019, ICML.

[43]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[46]  Quoc V. Le,et al.  Efficient Neural Architecture Search via Parameter Sharing , 2018, ICML.

[47]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[48]  Ting Wang,et al.  SirenAttack: Generating Adversarial Audio for End-to-End Acoustic Systems , 2019, AsiaCCS.