SEC4SR: A Security Analysis Platform for Speaker Recognition

Adversarial attacks have been expanded to speaker recognition (SR). However, existing attacks are often assessed using different SR models, recognition tasks and datasets, and only few adversarial defenses borrowed from computer vision are considered. Yet, these defenses have not been thoroughly evaluated against adaptive attacks. Thus, there is still a lack of quantitative understanding about the strengths and limitations of adversarial attacks and defenses. More effective defenses are also required for securing SR systems. To bridge this gap, we present SEC4SR, the first platform enabling researchers to systematically and comprehensively evaluate adversarial attacks and defenses in SR. SEC4SR incorporates 4 white-box and 2 black-box attacks, 24 defenses including our novel feature-level transformations. It also contains techniques for mounting adaptive attacks. Using SEC4SR, we conduct thus far the largest-scale empirical study on adversarial attacks and defenses in SR, involving 23 defenses, 15 attacks and 4 attack settings. Our study provides lots of useful findings that may advance future research: such as (1) all the transformations slightly degrade accuracy on benign examples and their effectiveness vary with attacks; (2) most transformations become less effective under adaptive attacks, but some transformations become more effective; (3) few transformations combined with adversarial training yield stronger defenses over some but not all attacks, while our feature-level transformation combined with adversarial training yields the strongest defense over all the attacks. Extensive experiments demonstrate capabilities and advantages of SEC4SR which can benefit future research in SR.

[1]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Nikita Vemuri,et al.  Targeted Adversarial Examples for Black Box Audio Systems , 2018, 2019 IEEE Security and Privacy Workshops (SPW).

[3]  Moustapha Cissé,et al.  Countering Adversarial Images using Input Transformations , 2018, ICLR.

[4]  Luis A. Leiva,et al.  Warped K-Means: An algorithm to cluster sequentially-distributed data , 2013, Inf. Sci..

[5]  Micah Sherr,et al.  Hidden Voice Commands , 2016, USENIX Security Symposium.

[6]  Yang Wang,et al.  Advbox: a toolbox to generate adversarial examples that fool neural networks , 2020, ArXiv.

[7]  Patrick Traynor,et al.  SoK: The Faults in our ASRs: An Overview of Attacks against Automatic Speech Recognition and Speaker Identification Systems , 2020, 2021 IEEE Symposium on Security and Privacy (SP).

[8]  Abeer Alwan,et al.  Using Voice Quality Features to Improve Short-Utterance, Text-Independent Speaker Verification Systems , 2017, INTERSPEECH.

[9]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[10]  Aaron Lawson,et al.  Analysis of Critical Metadata Factors for the Calibration of Speaker Recognition Systems , 2019, INTERSPEECH.

[11]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[12]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[13]  David A. Wagner,et al.  Audio Adversarial Examples: Targeted Attacks on Speech-to-Text , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[14]  Urmila Shrawankar,et al.  Techniques for Feature Extraction In Speech Recognition System : A Comparative Study , 2013, ArXiv.

[15]  Yue Zhao,et al.  CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition , 2018, USENIX Security Symposium.

[16]  Hiromu Yakura,et al.  Robust Audio Adversarial Example for a Physical Attack , 2018, IJCAI.

[17]  I. Elamvazuthi,et al.  Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques , 2010, ArXiv.

[18]  Nitesh Saxena,et al.  Quantifying the Breakability of Voice Assistants , 2019, 2019 IEEE International Conference on Pervasive Computing and Communications (PerCom.

[19]  Logan Engstrom,et al.  Synthesizing Robust Adversarial Examples , 2017, ICML.

[20]  Scot Hacker,et al.  MP3: The Definitive Guide , 2000 .

[21]  Patrick Traynor,et al.  Hear "No Evil", See "Kenansville": Efficient and Transferable Black-Box Attacks on Speech Recognition and Voice Identification Systems , 2019, ArXiv.

[22]  Mani B. Srivastava,et al.  Did you hear that? Adversarial Examples Against Automatic Speech Recognition , 2018, ArXiv.

[23]  Nitesh Saxena,et al.  All Your Voices are Belong to Us: Stealing Voices to Fool Humans and Machines , 2015, ESORICS.

[24]  Jun Zhu,et al.  Adversarial Distributional Training for Robust Deep Learning , 2020, NeurIPS.

[25]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[26]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[27]  Shrikanth Narayanan,et al.  Adversarial Attack and Defense Strategies for Deep Speaker Recognition Systems , 2021, Comput. Speech Lang..

[28]  Wen Gao,et al.  Learning to Fool the Speaker Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Louis Dunn Fielder,et al.  ISO/IEC MPEG-2 Advanced Audio Coding , 1997 .

[30]  Martin Wistuba,et al.  Adversarial Robustness Toolbox v1.0.0 , 2018, 1807.01069.

[31]  Jun Sun,et al.  Attack as defense: characterizing adversarial examples using robustness , 2021, ISSTA.

[32]  Jean-Marc Valin,et al.  Speex: A Free Codec For Free Speech , 2016, ArXiv.

[33]  Haizhou Li,et al.  Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation , 2016, EURASIP J. Adv. Signal Process..

[34]  Figen Ertaş,et al.  FUNDAMENTALS OF SPEAKER RECOGNITION , 2011 .

[35]  Moustapha Cissé,et al.  Fooling End-To-End Speaker Verification With Adversarial Examples , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Aleksander Madry,et al.  On Adaptive Attacks to Adversarial Example Defenses , 2020, NeurIPS.

[37]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[38]  Patrick D. McDaniel,et al.  Cleverhans V0.1: an Adversarial Machine Learning Library , 2016, ArXiv.

[39]  Bhiksha Raj,et al.  FoolHD: Fooling Speaker Identification by Highly Imperceptible Adversarial Disturbances , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[41]  I. Johansson,et al.  The adaptive multi-rate speech coder , 1999, 1999 IEEE Workshop on Speech Coding Proceedings. Model, Coders, and Error Criteria (Cat. No.99EX351).

[42]  Jian Liu,et al.  Enabling Fast and Universal Audio Adversarial Attack Using Generative Model , 2020, AAAI.

[43]  Christian Poellabauer,et al.  Crafting Adversarial Examples For Speech Paralinguistics Applications , 2017, ArXiv.

[44]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[45]  Ting Wang,et al.  SirenAttack: Generating Adversarial Audio for End-to-End Acoustic Systems , 2019, AsiaCCS.

[46]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[47]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[48]  Yanjun Qi,et al.  Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks , 2017, NDSS.

[49]  Klaus-Robert Müller,et al.  Interpreting and Explaining Deep Neural Networks for Classification of Audio Signals , 2018, ArXiv.

[50]  Jian Liu,et al.  AdvPulse: Universal, Synchronization-free, and Targeted Audio Adversarial Attacks via Subsecond Perturbations , 2020, CCS.

[51]  Shanqing Guo,et al.  SoK: A Modularized Approach to Study the Security of Automatic Speech Recognition Systems , 2021, ACM Trans. Priv. Secur..

[52]  Meikang Qiu,et al.  FenceBox: A Platform for Defeating Adversarial Examples with Data Augmentation Techniques , 2020, ArXiv.

[53]  Fangling Situ,et al.  Secure smart home: A voiceprint and internet based authentication system for remote accessing , 2016, 2016 11th International Conference on Computer Science & Education (ICCSE).

[54]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[55]  Alan L. Yuille,et al.  Mitigating adversarial effects through randomization , 2017, ICLR.

[56]  Jiliang Tang,et al.  DeepRobust: A PyTorch Library for Adversarial Attacks and Defenses , 2020, ArXiv.

[57]  Yingying Chen,et al.  Real-time, Robust and Adaptive Universal Adversarial Attacks Against Speaker Recognition Systems , 2021, Journal of Signal Processing Systems.

[58]  Russell C. Eberhart,et al.  A new optimizer using particle swarm theory , 1995, MHS'95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science.

[59]  Ajmal Mian,et al.  Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey , 2018, IEEE Access.

[60]  Seyed Reza Shahamiri,et al.  A review on Deep Learning approaches in Speaker Identification , 2016, ICSPS 2016.

[61]  Yang Liu,et al.  Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems , 2019, ArXiv.

[62]  Jianwei Yu,et al.  Adversarial Attacks on GMM I-Vector Based Speaker Verification Systems , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[63]  Bernhard U. Seeber,et al.  MP3 Compression To Diminish Adversarial Noise in End-to-End Speech Recognition , 2020, SPECOM.

[64]  Fu Song,et al.  Taking Care of The Discretization Problem: A Comprehensive Study of the Discretization Problem and A Black-Box Adversarial Attack in Discrete Integer Domain. , 2019 .

[65]  Colin Raffel,et al.  Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition , 2019, ICML.

[66]  David Wagner,et al.  Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods , 2017, AISec@CCS.

[67]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[68]  Zhuolin Yang,et al.  Characterizing Audio Adversarial Examples Using Temporal Dependency , 2018, ICLR.

[69]  Patrick Traynor,et al.  Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems , 2019, NDSS.

[70]  D. Prabakaran,et al.  A Review On Performance Of Voice Feature Extraction Techniques , 2019, 2019 3rd International Conference on Computing and Communications Technologies (ICCCT).

[71]  Tomi Kinnunen,et al.  I-vectors meet imitators: on vulnerability of speaker verification systems against voice mimicry , 2013, INTERSPEECH.

[72]  Eduardo Lleida,et al.  An Analysis of the Short Utterance Problem for Speaker Characterization , 2019 .

[73]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[74]  Patrick Kenny,et al.  Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification , 2009, INTERSPEECH.

[75]  Nitesh Saxena,et al.  Short voice imitation man-in-the-middle attacks on Crypto Phones: Defeating humans and machines , 2018, J. Comput. Secur..

[76]  Sanjeev Khudanpur,et al.  Speaker Recognition for Multi-speaker Conversations Using X-vectors , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[77]  Hyunsoo Yoon,et al.  POSTER: Detecting Audio Adversarial Example through Audio Modification , 2019, CCS.

[78]  J. Kalita,et al.  Speech Coding and Audio Preprocessing for Mitigating and Detecting Audio Adversarial Examples on Automatic Speech Recognition , 2018 .

[79]  Lei Xie,et al.  Inaudible Adversarial Perturbations for Targeted Attack in Speaker Recognition , 2020, INTERSPEECH.

[80]  Hang Su,et al.  Benchmarking Adversarial Robustness on Image Classification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  David A. Wagner,et al.  Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples , 2018, ICML.

[82]  Luyu Wang,et al.  advertorch v0.1: An Adversarial Robustness Toolbox based on PyTorch , 2019, ArXiv.

[83]  Koen Vos,et al.  Voice Coding with Opus , 2013 .

[84]  Ting Wang,et al.  DEEPSEC: A Uniform Platform for Security Analysis of Deep Learning Model , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[85]  Wen Gao,et al.  Universal Adversarial Perturbations Generative Network For Speaker Recognition , 2020, 2020 IEEE International Conference on Multimedia and Expo (ICME).

[86]  Kai Chen,et al.  Devil's Whisper: A General Approach for Physical Adversarial Attacks against Commercial Black-box Speech Recognition Devices , 2020, USENIX Security Symposium.

[87]  Dong Wang,et al.  A simulation study on optimal scores for speaker recognition , 2020, EURASIP J. Audio Speech Music. Process..

[88]  Dina Katabi,et al.  ME-Net: Towards Effective Adversarial Robustness with Matrix Estimation , 2019, ICML.

[89]  Konstantin Eckle,et al.  A comparison of deep networks with ReLU activation function and linear spline-type methods , 2018, Neural Networks.

[90]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[91]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[92]  Aladdin M. Ariyaeeinia,et al.  Open-set speaker identification using adapted Gaussian mixture models , 2005, INTERSPEECH.

[93]  Ana Heryana,et al.  Generalized Filter-bank Features for Robust Speech Recognition Against Reverberation , 2019, 2019 International Conference on Computer, Control, Informatics and its Applications (IC3INA).

[94]  Matthias Bethge,et al.  Foolbox v0.8.0: A Python toolbox to benchmark the robustness of machine learning models , 2017, ArXiv.

[95]  Thomas Fang Zheng,et al.  Attack on Practical Speaker Verification System Using Universal Adversarial Perturbations , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[96]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[97]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[98]  Li Chen,et al.  ADAGIO: Interactive Experimentation with Adversarial Attack and Defense for Audio , 2018, ECML/PKDD.

[99]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[100]  Xiao Liu,et al.  Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[101]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[102]  James R. Glass,et al.  Cosine Similarity Scoring without Score Normalization Techniques , 2010, Odyssey.