x-Vectors Meet Adversarial Attacks: Benchmarking Adversarial Robustness in Speaker Verification

Automatic Speaker Verification (ASV) enables high-security applications like user authentication or criminal investigation. However, ASV can be subjected to malicious attacks, which could compromise that security. The ASV literature mainly studies spoofing (a.k.a impersonation) attacks such as voice replay, synthesis or conversion. Meanwhile, other kinds of attacks, known as adversarial attacks, have become a threat to all kind of machine learning systems. Adversarial attacks introduce an imperceptible perturbation in the input signal that radically changes the behavior of the system. These attacks have been intensively studied in the image domain but less in the speech domain. In this work, we investigate the vulnerability of state-ofthe-art ASV systems to adversarial attacks. We consider a threat model consisting in adding a perturbation noise to the test waveform to alter the ASV decision. We also discuss the methodology and metrics to benchmark adversarial attacks and defenses in ASV. We evaluated three x-vector architectures, which performed among the best in recent ASV evaluations, against fast gradient sign and Carlini-Wagner attacks. All networks were highly vulnerable in the white-box attack scenario, even for high SNR (30-60 dB). Furthermore, we successfully transferred attacks generated with smaller white-box networks to attack a larger black-box network.

[1]  Zhuo Chen,et al.  Improving Deep CNN Networks with Long Temporal Context for Text-Independent Speaker Verification , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Tomi Kinnunen,et al.  ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection , 2019, INTERSPEECH.

[4]  Alan McCree,et al.  State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations , 2020, Comput. Speech Lang..

[5]  Moustapha Cissé,et al.  Fooling End-To-End Speaker Verification With Adversarial Examples , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Sanjeev Khudanpur,et al.  State-of-the-Art Speaker Recognition for Telephone and Video Speech: The JHU-MIT Submission for NIST SRE18 , 2019, INTERSPEECH.

[7]  David A. Wagner,et al.  Audio Adversarial Examples: Targeted Attacks on Speech-to-Text , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[8]  Seyed Omid Sadjadi,et al.  The 2019 NIST Speaker Recognition Evaluation CTS Challenge , 2020, Odyssey.

[9]  Hang Su,et al.  Benchmarking Adversarial Robustness , 2019, ArXiv.

[10]  Yue Zhao,et al.  CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition , 2018, USENIX Security Symposium.

[11]  Jianwei Yu,et al.  Adversarial Attacks on GMM I-Vector Based Speaker Verification Systems , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Dan Iter,et al.  Generating Adversarial Examples for Speech Recognition , 2017 .

[13]  Sarthak Yadav,et al.  Frequency and Temporal Convolutional Attention for Text-Independent Speaker Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[15]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[16]  Nanxin Chen,et al.  ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual neTworks , 2019, INTERSPEECH.

[17]  Shuai Wang,et al.  BUT System Description to VoxCeleb Speaker Recognition Challenge 2019 , 2019, ArXiv.

[18]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[19]  Haibin Wu,et al.  Defense Against Adversarial Attacks on Spoofing Countermeasures of ASV , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[21]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[22]  Bo Yuan,et al.  Real-Time, Universal, and Robust Adversarial Attacks Against Speaker Recognition Systems , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Joon Son Chung,et al.  Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..

[24]  Haibin Wu,et al.  Adversarial Attacks on Spoofing Countermeasures of Automatic Speaker Verification , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[25]  Hiromu Yakura,et al.  Robust Audio Adversarial Example for a Physical Attack , 2018, IJCAI.

[26]  Alan McCree,et al.  Advances in Speaker Recognition for Telephone and Audio-Visual Data: the JHU-MIT Submission for NIST SRE19 , 2020 .

[27]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Farinaz Koushanfar,et al.  Universal Adversarial Perturbations for Speech Recognition Systems , 2019, INTERSPEECH.

[29]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[30]  Colin Raffel,et al.  Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition , 2019, ICML.

[31]  Christian Poellabauer,et al.  Crafting Adversarial Examples For Speech Paralinguistics Applications , 2017, ArXiv.

[32]  Moustapha Cissé,et al.  Houdini: Fooling Deep Structured Prediction Models , 2017, ArXiv.

[33]  Dorothea Kolossa,et al.  Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding , 2018, NDSS.

[34]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[35]  Dan Boneh,et al.  Ensemble Adversarial Training: Attacks and Defenses , 2017, ICLR.