VSMask: Defending Against Voice Synthesis Attack via Real-Time Predictive Perturbation

Deep learning based voice synthesis technology generates artificial human-like speeches, which has been used in deepfakes or identity theft attacks. Existing defense mechanisms inject subtle adversarial perturbations into the raw speech audios to mislead the voice synthesis models. However, optimizing the adversarial perturbation not only consumes substantial computation time, but it also requires the availability of entire speech. Therefore, they are not suitable for protecting live speech streams, such as voice messages or online meetings. In this paper, we propose VSMask, a real-time protection mechanism against voice synthesis attacks. Different from offline protection schemes, VSMask leverages a predictive neural network to forecast the most effective perturbation for the upcoming streaming speech. VSMask introduces a universal perturbation tailored for arbitrary speech input to shield a real-time speech in its entirety. To minimize the audio distortion within the protected speech, we implement a weight-based perturbation constraint to reduce the perceptibility of the added perturbation. We comprehensively evaluate VSMask protection performance under different scenarios. The experimental results indicate that VSMask can effectively defend against 3 popular voice synthesis models. None of the synthetic voice could deceive the speaker verification models or human ears with VSMask protection. In a physical world experiment, we demonstrate that VSMask successfully safeguards the real-time speech by injecting the perturbation over the air.

[1]  Wei Ping,et al.  Defending against Adversarial Audio via Diffusion Model , 2023, ICLR.

[2]  Qiben Yan,et al.  SPECPATCH: Human-In-The-Loop Adversarial Audio Spectrogram Patch Attack on Speech Recognition , 2022, CCS.

[3]  Jiangyi Deng,et al.  V-Cloak: Intelligibility-, Naturalness- & Timbre-Preserving Real-Time Voice Anonymization , 2022, USENIX Security Symposium.

[4]  Qiben Yan,et al.  GhostTalk: Interactive Attack on Smartphone Voice System Through Power Line , 2022, NDSS.

[5]  Carl Vondrick,et al.  Real-Time Neural Voice Camouflage , 2021, ICLR.

[6]  Peipei Jiang,et al.  Black-box Adversarial Attacks on Commercial Speech Platforms with Minimal Information , 2021, CCS.

[7]  Ben Y. Zhao,et al.  "Hello, It's Me": Deep Learning-based Speech Synthesis Attacks in the Real World , 2021, CCS.

[8]  Helen Meng,et al.  Channel-wise Gated Res2Net: Towards Robust Detection of Synthetic Speech Attacks , 2021, Interspeech.

[9]  Titouan Parcollet,et al.  SpeechBrain: A General-Purpose Speech Toolkit , 2021, ArXiv.

[10]  A. Sharmin,et al.  Deep Insights of Deepfake Technology : A Review , 2021, ArXiv.

[11]  Liangliang Cao,et al.  Exploring Targeted Universal Adversarial Perturbations to End-to-end ASR Models , 2021, Interspeech.

[12]  F. Koushanfar,et al.  WaveGuard: Understanding and Mitigating Audio Adversarial Examples , 2021, USENIX Security Symposium.

[13]  Nicholas Evans,et al.  Speaker anonymisation using the McAdams coefficient , 2020, Interspeech.

[14]  Zhiyao Duan,et al.  One-Class Learning Towards Synthetic Voice Spoofing Detection , 2020, IEEE Signal Processing Letters.

[15]  Lin-shan Lee,et al.  Defending Your Voice: Adversarial Attack on Voice Conversion , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[16]  Yang Liu,et al.  Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems , 2019, 2021 IEEE Symposium on Security and Privacy (SP).

[17]  Masatoshi Yoshikawa,et al.  Voice-Indistinguishability -- Protecting Voiceprint with Differential Privacy under an Untrusted Server , 2020, CCS.

[18]  Jian Liu,et al.  AdvPulse: Universal, Synchronization-free, and Targeted Audio Adversarial Attacks via Subsecond Perturbations , 2020, CCS.

[19]  J. Duncan,et al.  AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients , 2020, NeurIPS.

[20]  Chong Xiang,et al.  Voiceprint Mimicry Attack Towards Speaker Verification System in Smart Home , 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications.

[21]  Jun Ho Huh,et al.  Void: A fast and light voice liveness detection system , 2020, USENIX Security Symposium.

[22]  Hamed Haddadi,et al.  Emotionless: Privacy-Preserving Speech Analysis for Voice Assistants , 2019, ArXiv.

[23]  Christian Poellabauer,et al.  Real-Time Adversarial Attacks , 2019, IJCAI.

[24]  Micah Sherr,et al.  You Talk Too Much: Limiting Privacy Exposure Via Voice Input , 2019, 2019 IEEE Security and Privacy Workshops (SPW).

[25]  Hung-yi Lee,et al.  One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization , 2019, INTERSPEECH.

[26]  Colin Raffel,et al.  Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition , 2019, ICML.

[27]  J. Zico Kolter,et al.  Certified Adversarial Robustness via Randomized Smoothing , 2019, ICML.

[28]  Hiromu Yakura,et al.  Robust Audio Adversarial Example for a Physical Attack , 2018, IJCAI.

[29]  Dorothea Kolossa,et al.  Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding , 2018, NDSS.

[30]  Linlin Chen,et al.  Hidebehind: Enjoy Voice Input with Voiceprint Unclonability and Anonymity , 2018, SenSys.

[31]  Kou Tanaka,et al.  StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[32]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[33]  Yang Gao,et al.  Voice Impersonation Using Generative Adversarial Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Yue Zhao,et al.  CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition , 2018, USENIX Security Symposium.

[35]  Moustapha Cissé,et al.  Fooling End-To-End Speaker Verification With Adversarial Examples , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  David A. Wagner,et al.  Audio Adversarial Examples: Targeted Attacks on Speech-to-Text , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[37]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[39]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[40]  Hirokazu Kameoka,et al.  Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks , 2017, ArXiv.

[41]  Christian Poellabauer,et al.  Crafting Adversarial Examples For Speech Paralinguistics Applications , 2017, ArXiv.

[42]  Sercan Ömer Arik,et al.  Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[43]  Yu Tsao,et al.  Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[44]  Yu Tsao,et al.  Voice conversion from non-parallel corpora using variational auto-encoder , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[45]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[46]  Micah Sherr,et al.  Hidden Voice Commands , 2016, USENIX Security Symposium.

[47]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Tomoki Toda,et al.  A postfilter to modify the modulation spectrum in HMM-based speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Seyed Hamidreza Mohammadi,et al.  Voice conversion using deep neural networks with speaker-independent pre-training , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[50]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[51]  Daniel Erro,et al.  Voice Conversion Based on Weighted Frequency Warping , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[52]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[53]  Jia Liu,et al.  Voice conversion with smoothed GMM and MAP adaptation , 2003, INTERSPEECH.

[54]  Jens Meyer,et al.  Noise cancelling for microphone arrays , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[55]  F. Amano,et al.  Echo cancellation and applications , 1990, IEEE Communications Magazine.