The DKU Replay Detection System for the ASVspoof 2019 Challenge: On Data Augmentation, Feature Representation, Classification, and Fusion

This paper describes our DKU replay detection system for the ASVspoof 2019 challenge. The goal is to develop spoofing countermeasure for automatic speaker recognition in physical access scenario. We leverage the countermeasure system pipeline from four aspects, including the data augmentation, feature representation, classification, and fusion. First, we introduce an utterance-level deep learning framework for anti-spoofing. It receives the variable-length feature sequence and outputs the utterance-level scores directly. Based on the framework, we try out various kinds of input feature representations extracted from either the magnitude spectrum or phase spectrum. Besides, we also perform the data augmentation strategy by applying the speed perturbation on the raw waveform. Our best single system employs a residual neural network trained by the speed-perturbed group delay gram. It achieves EER of 1.04% on the development set, as well as EER of 1.08% on the evaluation set. Finally, using the simple average score from several single systems can further improve the performance. EER of 0.24% on the development set and 0.66% on the evaluation set is obtained for our primary system.

[1]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[2]  Ming Li,et al.  Countermeasures for Automatic Speaker Verification Replay Spoofing Attack : On Data Augmentation, Feature Representation, Classification and Fusion , 2017, INTERSPEECH.

[3]  Jon Sánchez,et al.  Synthetic speech detection using phase information , 2016, Speech Commun..

[4]  Kong-Aik Lee,et al.  RedDots replayed: A new replay spoofing attack corpus for text-dependent speaker verification research , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Kong-Aik Lee,et al.  t-DCF: a Detection Cost Function for the Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification , 2018, Odyssey.

[6]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[7]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[8]  Nicholas W. D. Evans,et al.  Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification , 2017, Comput. Speech Lang..

[9]  Judith C. Brown Calculation of a constant Q spectral transform , 1991 .

[10]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[11]  Hema A. Murthy,et al.  The modified group delay function and its application to phoneme recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[12]  Tomi Kinnunen,et al.  Spoofing and countermeasures for automatic speaker verification , 2013, INTERSPEECH.

[13]  Kong-Aik Lee,et al.  ASVspoof 2017 Version 2.0: meta-data analysis and baseline enhancements , 2018, Odyssey.

[14]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[15]  Rajesh M. Hegde,et al.  Significance of the Modified Group Delay Feature in Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Ming Li,et al.  Insights into End-to-End Learning Scheme for Language Identification , 2018 .

[17]  Ming Li,et al.  Analysis of Length Normalization in End-to-End Speaker Verification System , 2018, INTERSPEECH.

[18]  Kuldip K. Paliwal,et al.  On the usefulness of STFT phase spectrum in human listening tests , 2005, Speech Commun..

[19]  Nicholas W. D. Evans,et al.  A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients , 2016, Odyssey.

[20]  John H. L. Hansen,et al.  An Investigation of Deep-Learning Frameworks for Speaker Verification Antispoofing , 2017, IEEE Journal of Selected Topics in Signal Processing.

[21]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[22]  John H. L. Hansen,et al.  Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[23]  Tomi Kinnunen,et al.  A comparison of features for synthetic speech detection , 2015, INTERSPEECH.

[24]  Sébastien Marcel,et al.  On the vulnerability of speaker verification to realistic voice spoofing , 2015, 2015 IEEE 7th International Conference on Biometrics Theory, Applications and Systems (BTAS).

[25]  Aleksandr Sizov,et al.  ASVspoof: The Automatic Speaker Verification Spoofing and Countermeasures Challenge , 2017, IEEE Journal of Selected Topics in Signal Processing.

[26]  Prasenjit Dey,et al.  End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention , 2018, INTERSPEECH.

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Kong-Aik Lee,et al.  The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection , 2017, INTERSPEECH.

[29]  Aleksandr Sizov,et al.  Classifiers for synthetic speech detection: a comparison , 2015, INTERSPEECH.

[30]  Aleksandr Sizov,et al.  ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge , 2015, INTERSPEECH.

[31]  Ming Li,et al.  Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System , 2018, Odyssey.

[32]  Galina Lavrentyeva,et al.  Audio Replay Attack Detection with Deep Learning Frameworks , 2017, INTERSPEECH.