Audio Deepfake Detection System with Neural Stitching for ADD 2022

This paper describes our best system and methodology for ADD 2022: The First Audio Deep Synthesis Detection Challenge[1]. The very same system was used for both two rounds of evaluation in Track 3.2 with similar training methodology. The first round of Track 3.2 data is generated from Text-to-Speech(TTS) or voice conversion (VC) algorithms, while the second round of data consists of generated fake audio from other participants in Track 3.1, aming to spoof our systems. Our systems uses a standard 34-layer ResNet [2], with multi-head attention pooling [3] to learn the discriminative embedding for fake audio and spoof detection. We further utilize neural stitching to boost the model’s generalization capability in order to perform equally well in different tasks, and more details will be explained in the following sessions. The experiments show that our proposed method outperforms all other systems with 10.1% equal error rate(EER) in Track 3.2.

[1]  Haizhou Li,et al.  ADD 2022: the first Audio Deep Synthesis Detection Challenge , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Vandana P. Janeja,et al.  How Deep Are the Fakes? Focusing on Audio Deepfake: A Survey , 2021, ArXiv.

[3]  Zhen-Hua Ling,et al.  Adversarial Voice Conversion Against Neural Spoofing Detectors , 2021, Interspeech.

[4]  Bin Ma,et al.  Towards Natural and Controllable Cross-Lingual Voice Conversion Based on Neural TTS Model and Phonetic Posteriorgram , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Tao Qin,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[6]  Ganesh Sivaraman,et al.  Generalization of Audio Deepfake Detection , 2020, Odyssey.

[7]  Jaehyeon Kim,et al.  HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , 2020, NeurIPS.

[8]  Kaisheng Yao,et al.  Multi-Resolution Multi-Head Attention in Deep Speaker Embedding , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Yi Yang,et al.  Random Erasing Data Augmentation , 2017, AAAI.

[10]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Diganta Misra,et al.  Mish: A Self Regularized Non-Monotonic Neural Activation Function , 2019, ArXiv.

[12]  Pooyan Safari,et al.  Self Multi-Head Attention for Speaker Recognition , 2019, INTERSPEECH.

[13]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[14]  Tomi Kinnunen,et al.  ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection , 2019, INTERSPEECH.

[15]  Sanjeev Khudanpur,et al.  Spoken Language Recognition using X-vectors , 2018, Odyssey.

[16]  Ming Li,et al.  A Novel Learnable Dictionary Encoding Layer for End-to-End Language Identification , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Koichi Shinoda,et al.  Attentive Statistics Pooling for Deep Speaker Embedding , 2018, INTERSPEECH.

[18]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[19]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[20]  Andrea Vedaldi,et al.  Understanding Image Representations by Measuring Their Equivariance and Equivalence , 2014, International Journal of Computer Vision.

[21]  Tomi Kinnunen,et al.  A comparison of features for synthetic speech detection , 2015, INTERSPEECH.

[22]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[23]  Barbara G Shinn-Cunningham,et al.  Localizing nearby sound sources in a classroom: binaural room impulse responses. , 2005, The Journal of the Acoustical Society of America.