Adversarial Attacks and Defenses for Speaker Identification Systems

Research in automatic speaker recognition (SR) has been undertaken for several decades, reaching great performance. However, researchers discovered potential loopholes in these technologies like spoofing attacks–voice replay, conversion, and synthesis– and thoroughly investigated them in the last years. Quite recently, a new genre of attack, termed adversarial attacks, has been proved to be fatal in computer vision (CV), and it is vital to study their effects on SR systems. This paper examines how state-of-the-art speaker identification (SID) systems are vulnerable to adversarial attacks and how to defend against them. We investigated adversarial attacks common in the literature like fast gradient sign method (FGSM), iterativeFGSM–a.k.a. basic iterative method (BIM)–, and Carlini-Wagner (CW). Furthermore, we propose four pre-processing defenses against these attacks–viz. randomized smoothing, DefenseGAN, variational autoencoder (VAE), and WaveGAN vocoder. We found that SID were extremely vulnerable under Iterative FGSM and Carlini-Wagner attacks. Randomized smoothing defense robustified the system for imperceptible BIM and Carlini-Wagner attacks recovering classification accuracies ∼ 97%. Defenses based on generative models (DefenseGAN, VAE, and WaveGAN) project adversarial examples (outside of the manifold) back into the clean manifold. In the case that the attacker cannot adapt the attack to the defense (black-box defense), WaveGAN performed the best, being close to the clean condition (Accuracy > 97%). However, if the attack is adapted to the defense–assuming the attacker has access to the defense model–(white-box defense), VAE and WaveGAN protection dropped significantly–50% and 37% accuracy for CW attack. To counteract this, we combined randomized smoothing with VAE or WaveGAN. We found that smoothing followed by WaveGAN vocoder was the most effective defense overall. As a black-box defense, it provides 93% average accuracy. A white-box defense, accuracy only degraded for iterative attacks with perceptible perturbations (L∞ ≥ 0.01).

[1]  Wen Gao,et al.  Universal Adversarial Perturbations Generative Network For Speaker Recognition , 2020, 2020 IEEE International Conference on Multimedia and Expo (ICME).

[2]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Chris Donahue,et al.  Adversarial Audio Synthesis , 2018, ICLR.

[4]  Moustapha Cissé,et al.  Countering Adversarial Images using Input Transformations , 2018, ICLR.

[5]  Hiromu Yakura,et al.  Robust Audio Adversarial Example for a Physical Attack , 2018, IJCAI.

[6]  Ananthram Swami,et al.  Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks , 2015, 2016 IEEE Symposium on Security and Privacy (SP).

[7]  Bin Dong,et al.  You Only Propagate Once: Painless Adversarial Training Using Maximal Principle , 2019 .

[8]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[9]  Lei Xie,et al.  Inaudible Adversarial Perturbations for Targeted Attack in Speaker Recognition , 2020, INTERSPEECH.

[10]  Colin Raffel,et al.  Thermometer Encoding: One Hot Way To Resist Adversarial Examples , 2018, ICLR.

[11]  Aditi Raghunathan,et al.  Certified Defenses against Adversarial Examples , 2018, ICLR.

[12]  Shuai Wang,et al.  BUT System Description to VoxCeleb Speaker Recognition Challenge 2019 , 2019, ArXiv.

[13]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[14]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[15]  J. Zico Kolter,et al.  Certified Adversarial Robustness via Randomized Smoothing , 2019, ICML.

[16]  Douglas A. Reynolds,et al.  The NIST speaker recognition evaluation - Overview, methodology, systems, results, perspective , 2000, Speech Commun..

[17]  Samy Bengio,et al.  Adversarial Machine Learning at Scale , 2016, ICLR.

[18]  David A. Wagner,et al.  Audio Adversarial Examples: Targeted Attacks on Speech-to-Text , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[19]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Moustapha Cissé,et al.  Fooling End-To-End Speaker Verification With Adversarial Examples , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Haizhou Li,et al.  The Attacker's Perspective on Automatic Speaker Verification: An Overview , 2020, INTERSPEECH.

[23]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[24]  Yang Liu,et al.  Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems , 2019, ArXiv.

[25]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[26]  Yoshua Bengio,et al.  Denoising Criterion for Variational Auto-Encoding Framework , 2015, AAAI.

[27]  Tomi Kinnunen,et al.  ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection , 2019, INTERSPEECH.

[28]  Yuekai Zhang,et al.  x-Vectors Meet Adversarial Attacks: Benchmarking Adversarial Robustness in Speaker Verification , 2020, INTERSPEECH.

[29]  Nanxin Chen,et al.  ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual neTworks , 2019, INTERSPEECH.

[30]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[31]  Yue Zhao,et al.  CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition , 2018, USENIX Security Symposium.

[32]  Dorothea Kolossa,et al.  Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding , 2018, NDSS.

[33]  Farinaz Koushanfar,et al.  Universal Adversarial Perturbations for Speech Recognition Systems , 2019, INTERSPEECH.

[34]  Joon Son Chung,et al.  Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..

[35]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[37]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[38]  David A. Wagner,et al.  Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples , 2018, ICML.

[39]  Jiliang Tang,et al.  Adversarial Attacks and Defenses in Images, Graphs and Text: A Review , 2019, International Journal of Automation and Computing.

[40]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[42]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[43]  Jianwei Yu,et al.  Adversarial Attacks on GMM I-Vector Based Speaker Verification Systems , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[45]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[46]  Dan Iter,et al.  Generating Adversarial Examples for Speech Recognition , 2017 .

[47]  Bernhard U. Seeber,et al.  MP3 Compression To Diminish Adversarial Noise in End-to-End Speech Recognition , 2020, SPECOM.

[48]  Rama Chellappa,et al.  Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Generative Models , 2018, ICLR.

[49]  Shrikanth Narayanan,et al.  Adversarial Defense for Deep Speaker Recognition Using Hybrid Adversarial Training , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Diqun Yan,et al.  Adversarial Examples Attack and Countermeasure for Speech Recognition System: A Survey , 2020, SPDE.

[51]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[52]  Christian Poellabauer,et al.  Crafting Adversarial Examples For Speech Paralinguistics Applications , 2017, ArXiv.

[53]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[54]  Moustapha Cissé,et al.  Houdini: Fooling Deep Structured Prediction Models , 2017, ArXiv.

[55]  Shrikanth Narayanan,et al.  Adversarial Attack and Defense Strategies for Deep Speaker Recognition Systems , 2021, Comput. Speech Lang..

[56]  H. Meng,et al.  Investigating Robustness of Adversarial Samples Detection for Automatic Speaker Verification , 2020, INTERSPEECH.

[57]  Bo Yuan,et al.  Real-Time, Universal, and Robust Adversarial Attacks Against Speaker Recognition Systems , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[58]  Kui Ren,et al.  Adversarial Attacks and Defenses in Deep Learning , 2020, Engineering.

[59]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[60]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[61]  Colin Raffel,et al.  Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition , 2019, ICML.

[62]  Soheil Feizi,et al.  Wasserstein Smoothing: Certified Robustness against Wasserstein Adversarial Attacks , 2019, AISTATS.

[63]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[64]  Yang Song,et al.  PixelDefend: Leveraging Generative Models to Understand and Defend against Adversarial Examples , 2017, ICLR.

[65]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[66]  Patrick Traynor,et al.  SoK: The Faults in our ASRs: An Overview of Attacks against Automatic Speech Recognition and Speaker Identification Systems , 2020, 2021 IEEE Symposium on Security and Privacy (SP).

[67]  Hang Su,et al.  Benchmarking Adversarial Robustness , 2019, ArXiv.

[68]  Alan McCree,et al.  State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations , 2020, Comput. Speech Lang..

[69]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Seyed-Mohsen Moosavi-Dezfooli,et al.  Universal Adversarial Perturbations , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Yuekai Zhang,et al.  Black-Box Attacks on Spoofing Countermeasures Using Transferability of Adversarial Examples , 2020, INTERSPEECH.

[72]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[73]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.