Replay and Synthetic Speech Detection with Res2Net Architecture

Existing approaches for replay and synthetic speech detection still lack generalizability to unseen spoofing attacks. This work proposes to leverage a novel model structure, so-called Res2Net, to improve the anti-spoofing countermeasure’s generalizability. Res2Net mainly modifies the ResNet block to enable multiple feature scales. Specifically, it splits the feature maps within one block into multiple channel groups and designs a residual-like connection across different channel groups. Such connection increases the possible receptive fields, resulting in multiple feature scales. This multiple scaling mechanism significantly improves the countermeasure’s generalizability to unseen spoofing attacks. It also decreases the model size compared to ResNet-based models. Experimental results show that the Res2Net model consistently outperforms ResNet34 and ResNet50 by a large margin in both physical access (PA) and logical access (LA) of the ASVspoof 2019 corpus. Moreover, integration with the squeeze-and-excitation (SE) block can further enhance performance. For feature engineering, we investigate the gen-eralizability of Res2Net combined with different acoustic features, and observe that the constant-Q transform (CQT) achieves the most promising performance in both PA and LA scenarios. Our best single system outperforms other state-of-the-art single systems in both PA and LA of the ASVspoof 2019 corpus.

[1]  Nanxin Chen,et al.  ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual neTworks , 2019, INTERSPEECH.

[2]  Chng Eng Siong,et al.  Vulnerability of speaker verification systems against voice conversion spoofing attacks: The case of telephone speech , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Antonio M. Peinado,et al.  A Kernel Density Estimation Based Loss Function and its Application to ASV-Spoofing Detection , 2020, IEEE Access.

[4]  Hyo Jong Lee,et al.  Improved Res2Net Model for Person re-Identification , 2019, 2019 IEEE First International Conference on Cognitive Machine Intelligence (CogMI).

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Thomas Fang Zheng,et al.  Replay detection using CQT-based modified group delay feature and ResNeWt network in ASVspoof 2019 , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[7]  Galina Lavrentyeva,et al.  STC Antispoofing Systems for the ASVspoof2019 Challenge , 2019, INTERSPEECH.

[8]  Kong-Aik Lee,et al.  The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection , 2017, INTERSPEECH.

[9]  Haizhou Li,et al.  Detecting Converted Speech and Natural Speech for anti-Spoofing Attack in Speaker Recognition , 2012, INTERSPEECH.

[10]  Ming Li,et al.  The DKU Replay Detection System for the ASVspoof 2019 Challenge: On Data Augmentation, Feature Representation, Classification, and Fusion , 2019, INTERSPEECH.

[11]  Konstantin Simonchik,et al.  Examining Vulnerability of Voice Verification Systems to Spoofing Attacks by Means of a TTS System , 2013, SPECOM.

[12]  Haizhou Li,et al.  Light Convolutional Neural Network with Feature Genuinization for Detection of Synthetic Speech Attacks , 2020, INTERSPEECH.

[13]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[15]  H. Meng,et al.  Investigating Robustness of Adversarial Samples Detection for Automatic Speaker Verification , 2020, INTERSPEECH.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Haizhou Li,et al.  Assessing the Scope of Generalized Countermeasures for Anti-Spoofing , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Tomi Kinnunen,et al.  ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection , 2019, INTERSPEECH.

[19]  Mani B. Srivastava,et al.  Deep Residual Neural Networks for Audio Spoofing Detection , 2019, INTERSPEECH.

[20]  Jianwei Yu,et al.  Adversarial Attacks on GMM I-Vector Based Speaker Verification Systems , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Kai Zhao,et al.  Res2Net: A New Multi-Scale Backbone Architecture , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Aleksandr Sizov,et al.  ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge , 2015, INTERSPEECH.

[24]  Wujie Zhou,et al.  GFNet: Gate Fusion Network With Res2Net for Detecting Salient Objects in RGB-D Images , 2020, IEEE Signal Processing Letters.

[25]  Haizhou Li,et al.  The Attacker's Perspective on Automatic Speaker Verification: An Overview , 2020, INTERSPEECH.