Time Domain Adversarial Voice Conversion for ADD 2022

In this paper, we describe our speech generation system for the first Audio Deep Synthesis Detection Challenge (ADD 2022). Firstly, we build an any-to-many voice conversion (VC) system to convert source speech with arbitrary language content into target speaker’s fake speech. Then the converted speech generated from VC is post-processed in time-domain to improve the deception ability. The experimental results show that our system has adversarial ability against anti-spoofing detectors with a little compromise in audio quality and speaker similarity. This system ranks top in Track 3.1 in the ADD 2022, showing that our method could also gain good generalization ability against different detectors.

[1]  Konstantin Böttinger,et al.  Human Perception of Audio Deepfakes , 2022, DDAM@MM.

[2]  Haizhou Li,et al.  ADD 2022: the first Audio Deep Synthesis Detection Challenge , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Vandana P. Janeja,et al.  How Deep Are the Fakes? Focusing on Audio Deepfake: A Survey , 2021, ArXiv.

[4]  Zhen-Hua Ling,et al.  Adversarial Voice Conversion Against Neural Spoofing Detectors , 2021, Interspeech.

[5]  Nima Mesgarani,et al.  StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion , 2021, Interspeech.

[6]  Manh Luong,et al.  Many-to-Many Voice Conversion based Feature Disentanglement using Variational Autoencoder , 2021, Interspeech.

[7]  Tao Qin,et al.  A Survey on Neural Speech Synthesis , 2021, ArXiv.

[8]  Tao Qin,et al.  AdaSpeech: Adaptive Text to Speech for Custom Voice , 2021, ICLR.

[9]  Bin Ma,et al.  Towards Natural and Controllable Cross-Lingual Voice Conversion Based on Neural TTS Model and Phonetic Posteriorgram , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Hui Bu,et al.  AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines , 2020, ArXiv.

[11]  Kun Han,et al.  Didispeech: A Large Scale Mandarin Speech Corpus , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2021 .

[13]  Jaehyeon Kim,et al.  HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , 2020, NeurIPS.

[14]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[15]  Tomoki Toda,et al.  Non-Parallel Voice Conversion with Cyclic Variational Autoencoder , 2019, INTERSPEECH.

[16]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[17]  Kou Tanaka,et al.  StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[18]  Junichi Yamagishi,et al.  High-Quality Nonparallel Voice Conversion Based on Cycle-Consistent Adversarial Network , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Sercan Ömer Arik,et al.  Neural Voice Cloning with a Few Samples , 2018, NeurIPS.

[20]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).