An Investigation of Deep-Learning Frameworks for Speaker Verification Antispoofing

In this study, we explore the use of deep-learning approaches for spoofing detection in speaker verification. Most spoofing detection systems that have achieved recent success employ hand-craft features with specific spoofing prior knowledge, which may limit the feasibility to unseen spoofing attacks. We aim to investigate the genuine-spoofing discriminative ability from the back-end stage, utilizing recent advancements in deep-learning research. In this paper, alternative network architectures are exploited to target spoofed speech. Based on this analysis, a novel spoofing detection system, which simultaneously employs convolutional neural networks (CNNs) and recurrent neural networks (RNNs) is proposed. In this framework, CNN is treated as a convolutional feature extractor applied on the speech input. On top of the CNN processed output, recurrent networks are employed to capture long-term dependencies across the time domain. Novel features including Teager energy operator critical band autocorrelation envelope, perceptual minimum variance distortionless response, and a more general spectrogram are also investigated as inputs to our proposed deep-learning frameworks. Experiments using the ASVspoof 2015 Corpus show that the integrated CNN–RNN framework achieves state-of-the-art single-system performance. The addition of score-level fusion further improves system robustness. A detailed analysis shows that our proposed approach can potentially compensate for the issue due to short duration test utterances, which is also an issue in the evaluation corpus.

[1]  John H. L. Hansen,et al.  I-vector based physical task stress detection with different fusion strategies , 2015, INTERSPEECH.

[2]  G. Montavon Deep learning for spoken language identification , 2009 .

[3]  Tomoki Toda,et al.  SAS: A speaker verification spoofing database containing diverse attacks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Nicholas W. D. Evans,et al.  A one-class classification approach to generalised speaker verification spoofing countermeasures using local binary patterns , 2013, 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS).

[6]  Matti Pietikäinen,et al.  Face Description with Local Binary Patterns: Application to Face Recognition , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[8]  John H. L. Hansen,et al.  Nonlinear feature based classification of speech under stress , 2001, IEEE Trans. Speech Audio Process..

[9]  Themos Stafylakis,et al.  Development of CRIM system for the automatic speaker verification spoofing and countermeasures challenge 2015 , 2015, INTERSPEECH.

[10]  Eduardo Lleida,et al.  Spoofing detection with DNN and one-class SVM for the ASVspoof 2015 challenge , 2015, INTERSPEECH.

[11]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[12]  Tomi Kinnunen,et al.  I-vectors meet imitators: on vulnerability of speaker verification systems against voice mimicry , 2013, INTERSPEECH.

[13]  Ibon Saratxaga,et al.  Evaluation of Speaker Verification Security and Detection of HMM-Based Synthetic Speech , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Eduardo Lleida,et al.  Preventing replay attacks on speaker verification systems , 2011, 2011 Carnahan Conference on Security Technology.

[15]  Haizhou Li,et al.  A study on spoofing attack in state-of-the-art speaker verification: the telephone speech case , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[16]  John H. L. Hansen,et al.  Duration mismatch compensation for i-vector based speaker recognition systems , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Bo Chen,et al.  Robust deep feature for spoofing detection - the SJTU system for ASVspoof 2015 challenge , 2015, INTERSPEECH.

[18]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Sébastien Marcel,et al.  On the effectiveness of local binary patterns in face anti-spoofing , 2012, 2012 BIOSIG - Proceedings of the International Conference of Biometrics Special Interest Group (BIOSIG).

[20]  John H. L. Hansen,et al.  An experimental study of speaker verification sensitivity to computer voice-altered imposters , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[21]  Nicholas W. D. Evans,et al.  Spoofing countermeasures for the protection of automatic speaker recognition systems against attacks with artificial signals , 2012, INTERSPEECH.

[22]  David Menotti,et al.  Deep Representations for Iris, Face, and Fingerprint Spoofing Detection , 2014, IEEE Transactions on Information Forensics and Security.

[23]  Haizhou Li,et al.  A study on replay attack and anti-spoofing for text-dependent speaker verification , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[24]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[25]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[26]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[27]  Aleksandr Sizov,et al.  Introducing i-vectors for joint anti-spoofing and speaker verification , 2014, INTERSPEECH.

[28]  Haizhou Li,et al.  Synthetic speech detection using temporal modulation feature , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Liu Gang,et al.  Joint information from nonlinear and linear features for spoofing detection: An i-vector/DNN based approach , 2016 .

[30]  Nicholas W. D. Evans,et al.  A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients , 2016, Odyssey.

[31]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[32]  Junichi Yamagishi,et al.  ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan , 2021, ArXiv.

[33]  Haizhou Li,et al.  Spoofing speech detection using high dimensional magnitude and phase features: the NTU approach for ASVspoof 2015 challenge , 2015, INTERSPEECH.

[34]  Zhizheng Wu,et al.  Human vs machine spoofing detection on wideband and narrowband data , 2015, INTERSPEECH.

[35]  Themos Stafylakis,et al.  PLDA for speaker verification with utterances of arbitrary duration , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  John H. L. Hansen,et al.  A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition , 2008, Speech Commun..

[37]  Aleksandr Sizov,et al.  Classifiers for synthetic speech detection: a comparison , 2015, INTERSPEECH.

[38]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[39]  Sébastien Marcel,et al.  On the vulnerability of speaker verification to realistic voice spoofing , 2015, 2015 IEEE 7th International Conference on Biometrics Theory, Applications and Systems (BTAS).

[40]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Haizhou Li,et al.  Detecting Converted Speech and Natural Speech for anti-Spoofing Attack in Speaker Recognition , 2012, INTERSPEECH.

[42]  Sébastien Marcel,et al.  Presentation Attack Detection Using Long-Term Spectral Statistics for Trustworthy Speaker Verification , 2016, 2016 International Conference of the Biometrics Special Interest Group (BIOSIG).

[43]  Stan Z. Li,et al.  Learn Convolutional Neural Network for Face Anti-Spoofing , 2014, ArXiv.

[44]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[45]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[46]  Hemant A. Patil,et al.  Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech , 2015, INTERSPEECH.

[47]  John H. L. Hansen,et al.  An Investigation into Back-end Advancements for Speaker Recognition in Multi-Session and Noisy Enrollment Scenarios , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[48]  John H. L. Hansen,et al.  Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[49]  Nicholas W. D. Evans,et al.  Re-assessing the threat of replay spoofing attacks against automatic speaker verification , 2014, 2014 International Conference of the Biometrics Special Interest Group (BIOSIG).

[50]  Tomi Kinnunen,et al.  A comparison of features for synthetic speech detection , 2015, INTERSPEECH.

[51]  Aleksandr Sizov,et al.  ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge , 2015, INTERSPEECH.