A Tandem Framework Balancing Privacy and Security for Voice User Interfaces

Speech synthesis, voice cloning, and voice conversion techniques present severe privacy and security threats to users of voice user interfaces (VUIs). These techniques transform one or more elements of a speech signal, e.g.,, identity, emotion, or accent, while preserving linguistic information. Adversaries may use advanced transformation tools to trigger a spoofing attack using fraudulent biometrics for a legitimate speaker. Conversely, such techniques have been used to generate privacy-transformed speech by suppressing personally identifiable attributes in the voice signals, achieving anonymization. Prior works have studied the security and privacy vectors in parallel, and thus it raises alarm that if a benign user can achieve privacy by a transformation, it also means that a malicious user can break security by bypassing the anti-spoofing mechanism. In this paper, we take a step towards balancing two seemingly conflicting requirements: security and privacy. It remains unclear what the vulnerabilities in one domain imply for the other, and what dynamic interactions exist between them. A better understanding of these aspects is crucial for assessing and mitigating vulnerabilities inherent with VUIs and building effective defenses. In this paper, (i) we investigate the applicability of the current voice anonymization methods by deploying a tandem framework that jointly combines anti-spoofing and authentication models, and evaluate the performance of these methods; (ii) examining analytical and empirical evidence, we reveal a duality between the two mechanisms as they offer different ways to achieve the same objective, and we show that leveraging one vector significantly amplifies the effectiveness of the other; (iii) we demonstrate that to effectively defend from potential attacks against VUIs, it is necessary to investigate the attacks from multiple complementary perspectives (i.e., security and privacy) and carefully account for the effects of deploying countermeasures, pointing to several promising research directions.

[1]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[2]  Aaron Weiss,et al.  Hear no evil , 2000, NTWK.

[3]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[4]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[5]  David A. van Leeuwen,et al.  An Introduction to Application-Independent Evaluation of Speaker Recognition Systems , 2007, Speaker Classification.

[6]  M. Wester The EMIME Bilingual Database , 2010 .

[7]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[8]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Simon King,et al.  Towards Personalised Synthesised Voices for Individuals with Vocal Disabilities: Voice Banking and Reconstruction , 2013, SLPAT.

[10]  Yuan Liu,et al.  Tandem deep features for text-dependent speaker verification , 2014, INTERSPEECH.

[11]  Haizhou Li,et al.  Voice conversion versus speaker verification: an overview , 2014 .

[12]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[13]  Liang He,et al.  Investigation of bottleneck features and multilingual deep neural networks for speaker verification , 2015, INTERSPEECH.

[14]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[15]  Ya Zhang,et al.  Deep feature for text-dependent speaker verification , 2015, Speech Commun..

[16]  Patrick D. McDaniel,et al.  Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples , 2016, ArXiv.

[17]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[19]  Gautam Bhattacharya,et al.  Deep Neural Network based Text-Dependent Speaker Recognition: Preliminary Results , 2016 .

[20]  Facebook,et al.  Houdini : Fooling Deep Structured Visual and Speech Recognition Models with Adversarial Examples , 2017 .

[21]  C. Dwork,et al.  Exposed! A Survey of Attacks on Private Data , 2017, Annual Review of Statistics and Its Application.

[22]  Thomas Fang Zheng,et al.  Robustness-Related Issues in Speaker Recognition , 2017 .

[23]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[24]  Linlin Chen,et al.  Hidebehind: Enjoy Voice Input with Voiceprint Unclonability and Anonymity , 2018, SenSys.

[25]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Mani B. Srivastava,et al.  Did you hear that? Adversarial Examples Against Automatic Speech Recognition , 2018, ArXiv.

[27]  David A. Wagner,et al.  Audio Adversarial Examples: Targeted Attacks on Speech-to-Text , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[28]  Yue Zhao,et al.  CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition , 2018, USENIX Security Symposium.

[29]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[30]  Micah Sherr,et al.  You Talk Too Much: Limiting Privacy Exposure Via Voice Input , 2019, 2019 IEEE Security and Privacy Workshops (SPW).

[31]  Colin Raffel,et al.  Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition , 2019, ICML.

[32]  Haibin Wu,et al.  Adversarial Attacks on Spoofing Countermeasures of Automatic Speaker Verification , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[33]  Prateek Mittal,et al.  Privacy Risks of Securing Machine Learning Models against Adversarial Examples , 2019, CCS.

[34]  Hamed Haddadi,et al.  Emotion Filtering at the Edge , 2019, SenSys-ML.

[35]  Hamed Haddadi,et al.  Emotionless: Privacy-Preserving Speech Analysis for Voice Assistants , 2019, ArXiv.

[36]  Thomas Drugman,et al.  Towards Achieving Robust Universal Neural Vocoding , 2018, INTERSPEECH.

[37]  Nikita Vemuri,et al.  Targeted Adversarial Examples for Black Box Audio Systems , 2018, 2019 IEEE Security and Privacy Workshops (SPW).

[38]  Dorothea Kolossa,et al.  Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding , 2018, NDSS.

[39]  Khe Chai Sim,et al.  An Investigation Into On-device Personalization of End-to-end Automatic Speech Recognition Models , 2019, INTERSPEECH.

[40]  Patrick Traynor,et al.  Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems , 2019, NDSS.

[41]  Marc Tommasi,et al.  Privacy-Preserving Adversarial Representation Learning in ASR: Reality or Illusion? , 2019, INTERSPEECH.

[42]  Ashwin Machanavajjhala,et al.  Privacy Changes Everything , 2019, Poly/DMAH@VLDB.

[43]  Patrick Cardinal,et al.  Universal Adversarial Audio Perturbations , 2019, ArXiv.

[44]  Junichi Yamagishi,et al.  CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92) , 2019 .

[45]  Shinnosuke Takamichi,et al.  V2S attack: building DNN-based voice conversion from automatic speaker verification , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[46]  Amir Houmansadr,et al.  Comprehensive Privacy Analysis of Deep Learning: Passive and Active White-box Inference Attacks against Centralized and Federated Learning , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[47]  Vitaly Shmatikov,et al.  Auditing Data Provenance in Text-Generation Models , 2018, KDD.

[48]  Emily Mower Provost,et al.  Privacy Enhanced Multimodal Neural Representations for Emotion Recognition , 2019, AAAI.

[49]  Junichi Yamagishi,et al.  The Privacy ZEBRA: Zero Evidence Biometric Recognition Assessment , 2020, INTERSPEECH.

[50]  Mario Fritz,et al.  GAN-Leaks: A Taxonomy of Membership Inference Attacks against Generative Models , 2019, CCS.

[51]  Lauri Juvela,et al.  ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech , 2019, Comput. Speech Lang..

[52]  Haibin Wu,et al.  Defense for Black-box Attacks on Anti-spoofing Models by Self-Supervised Learning , 2020, INTERSPEECH.

[53]  Yuekai Zhang,et al.  x-Vectors Meet Adversarial Attacks: Benchmarking Adversarial Robustness in Speaker Verification , 2020, INTERSPEECH.

[54]  Reza Shokri,et al.  ML Privacy Meter: Aiding Regulatory Compliance by Quantifying the Privacy Risks of Machine Learning , 2020, ArXiv.

[55]  Haizhou Li,et al.  Light Convolutional Neural Network with Feature Genuinization for Detection of Synthetic Speech Attacks , 2020, INTERSPEECH.

[56]  Junichi Yamagishi,et al.  Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion , 2020, Blizzard Challenge / Voice Conversion Challenge.

[57]  Marc Tommasi,et al.  Design Choices for X-vector Based Speaker Anonymization , 2020, INTERSPEECH.

[58]  Junichi Yamagishi,et al.  Introducing the VoicePrivacy Initiative , 2020, INTERSPEECH.

[59]  Daniel J. Dubois,et al.  When Speakers Are All Ears: Characterizing Misactivations of IoT Smart Speakers , 2020, Proc. Priv. Enhancing Technol..

[60]  Hamed Haddadi,et al.  Privacy-preserving Voice Analysis via Disentangled Representations , 2020, CCSW@CCS.

[61]  Alan McCree,et al.  State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations , 2020, Comput. Speech Lang..

[62]  Evaluating Voice Conversion-Based Privacy Protection against Informed Attackers , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[63]  Haizhou Li,et al.  The Attacker's Perspective on Automatic Speaker Verification: An Overview , 2020, INTERSPEECH.

[64]  Yifan Gong,et al.  Using Personalized Speech Synthesis and Neural Language Generator for Rapid Speaker Adaptation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[65]  Emmanouil Benetos,et al.  Subband modeling for spoofing detection in automatic speaker verification , 2020 .

[66]  Parameswaran Ramanathan,et al.  Prεεch: A System for Privacy-Preserving Speech Transcription , 2019, USENIX Security Symposium.

[67]  Douglas A. Reynolds,et al.  Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[68]  Haizhou Li,et al.  Assessing the Scope of Generalized Countermeasures for Anti-Spoofing , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[69]  Vitaly Shmatikov,et al.  Overlearning Reveals Sensitive Attributes , 2019, ICLR.

[70]  An empirical analysis of information encoded in disentangled neural speaker representations , 2020, Odyssey.

[71]  Dongsuk Yook,et al.  Speaker Anonymization for Personal Information Protection Using Voice Conversion Techniques , 2020, IEEE Access.

[72]  Masatoshi Yoshikawa,et al.  Voice-Indistinguishability: Protecting Voiceprint In Privacy-Preserving Speech Data Release , 2020, 2020 IEEE International Conference on Multimedia and Expo (ICME).

[73]  Jinsung Yoon,et al.  Anonymization Through Data Synthesis Using Generative Adversarial Networks (ADS-GAN) , 2020, IEEE Journal of Biomedical and Health Informatics.

[74]  Lin-shan Lee,et al.  Defending Your Voice: Adversarial Attack on Voice Conversion , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[75]  Jun Zhang,et al.  The Audio Auditor: User-Level Membership Inference in Internet of Things Voice Services , 2019, Proc. Priv. Enhancing Technol..

[76]  Patrick Traynor,et al.  Hear "No Evil", See "Kenansville": Efficient and Transferable Black-Box Attacks on Speech Recognition and Voice Identification Systems , 2019, ArXiv.

[77]  Guofei Gu,et al.  Practical Speech Re-use Prevention in Voice-driven Services , 2021, ArXiv.

[78]  Hwai-Tsu Hu,et al.  Robust Blind Speech Watermarking via FFT-Based Perceptual Vector Norm Modulation With Frame Self-Synchronization , 2021, IEEE Access.

[79]  Rogier C. van Dalen,et al.  Federated Evaluation and Tuning for On-Device Personalization: System Design & Applications , 2021, ArXiv.

[80]  Masato Akagi,et al.  Cross-Lingual Voice Conversion With Controllable Speaker Individuality Using Variational Autoencoder and Star Generative Adversarial Network , 2021, IEEE Access.

[81]  Simon King,et al.  An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[82]  Patrick Lumban Tobing,et al.  Crank: An Open-Source Software for Nonparallel Voice Conversion Based on Vector-Quantized Variational Autoencoder , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[83]  Denis Jouvet,et al.  A Study of F0 Modification for X-Vector Based Speech Pseudonymization Across Gender , 2021 .

[84]  H. Haddadi,et al.  Configurable Privacy-Preserving Automatic Speech Recognition , 2021, Interspeech.

[85]  Thomas Fang Zheng,et al.  When Automatic Voice Disguise Meets Automatic Speaker Verification , 2021, IEEE Transactions on Information Forensics and Security.

[86]  R. Kumaraswamy,et al.  Voice conversion spoofing detection by exploring artifacts estimates , 2021, Multimedia Tools and Applications.