Learning From Yourself: A Self-Distillation Method For Fake Speech Detection

In this paper, we propose a novel self-distillation method for fake speech detection (FSD), which can significantly improve the performance of FSD without increasing the model complexity. For FSD, some fine-grained information is very important, such as spectrogram defects, mute segments, and so on, which are often perceived by shallow networks. However, shallow networks have much noise, which can not capture this very well. To address this problem, we propose using the deepest network instruct shallow network for enhancing shallow networks. Specifically, the networks of FSD are divided into several segments, the deepest network being used as the teacher model, and all shallow networks become multiple student models by adding classifiers. Meanwhile, the distillation path between the deepest network feature and shallow network features is used to reduce the feature difference. A series of experimental results on the ASVspoof 2019 LA and PA datasets show the effectiveness of the proposed method, with significant improvements compared to the baseline.

[1]  Diqun Yan,et al.  Detection of Synthetic Speech Based on Spectrum Defects , 2022, DDAM@MM.

[2]  J. Dang,et al.  Self-Distillation Based on High-level Information Supervision for Compressing End-to-End ASR Model , 2022, INTERSPEECH.

[3]  J. Tao,et al.  Fully Automated End-to-End Fake Audio Detection , 2022, DDAM@MM.

[4]  J. Tao,et al.  Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features , 2022, DDAM@MM.

[5]  Y. Qian,et al.  Self-Knowledge Distillation via Feature Enhancement for Speaker Verification , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Hemlata Tak,et al.  AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Hefei Ling,et al.  Attention-Based Convolutional Neural Network for ASV Spoofing Detection , 2021, Interspeech.

[8]  Pengyuan Zhang,et al.  The Effect of Silence and Dual-Band Fusion in Anti-Spoofing System , 2021, Interspeech.

[9]  Madhu R. Kamble,et al.  End-to-End Spectro-Temporal Graph Attention Networks for Speaker Verification Anti-Spoofing and Speech Deepfake Detection , 2021, 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge.

[10]  Helen Meng,et al.  Channel-wise Gated Res2Net: Towards Robust Detection of Synthetic Speech Attacks , 2021, Interspeech.

[11]  Xiangui Kang,et al.  A Capsule Network Based Approach for Detection of Audio Spoofing Attacks , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Rui Zhao,et al.  Self-distillation with Batch Knowledge Ensembling Improves ImageNet Classification , 2021, ArXiv.

[13]  Tomi Kinnunen,et al.  ASVspoof 2019: Spoofing Countermeasures for the Detection of Synthesized, Converted and Replayed Speech , 2021, IEEE Transactions on Biometrics, Behavior, and Identity Science.

[14]  A. Nautsch,et al.  End-to-End anti-spoofing with RawNet2 , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Xu Li,et al.  Replay and Synthetic Speech Detection with Res2Net Architecture , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Zhiyao Duan,et al.  One-Class Learning Towards Synthetic Voice Spoofing Detection , 2020, IEEE Signal Processing Letters.

[17]  W. Zuo,et al.  ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Kaisheng Ma,et al.  Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Hye-jin Shim,et al.  RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification , 2019, INTERSPEECH.