A Gated Recurrent Convolutional Neural Network for Robust Spoofing Detection

Automatic speaker verification (ASV) systems are exposed to spoofing attacks which may compromise their security. While anti-spoofing techniques have been mainly studied for clean scenarios, it has also been shown that they perform poorly in noisy environments. In this work, we aim at improving the performance of spoofing detection for ASV in clean and noisy scenarios. To achieve this, we first propose the use of Gated Recurrent Convolutional Neural Networks (GRCNNs) as a deep feature extractor to robustly represent speech signals as utterance-level embeddings, which are later used by a back-end recognizer for the final genuine/spoofed classification. Then, to enhance the robustness of the system in noisy conditions, we propose the use of signal-to-noise masks (SNMs) as new input features to inform the anti-spoofing system about the time-frequency regions of the input spectral features that are mostly affected by noise and, hence, should be neglected when computing the embeddings. To evaluate our proposals, experiments were carried out on the clean and noisy versions of the ASVspoof 2015 corpus for detecting logical access attacks, as well as on the ASVspoof 2017 database to detect replay attacks. Additional results are provided for the ASVspoof 2019 corpus, including both logical and physical scenarios. The experimental results show that our proposal clearly outperforms some well-known methods based on classical features and other similar deep feature based systems for both clean and noisy conditions.

[1]  Jon Barker,et al.  Soft decisions in missing data techniques for robust automatic speech recognition , 2000, INTERSPEECH.

[2]  Rohan Kumar Das,et al.  Low frequency frame-wise normalization over constant-Q transform for playback speech detection , 2019, Digit. Signal Process..

[3]  Hemant A. Patil,et al.  Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech , 2015, INTERSPEECH.

[4]  Simone Scardapane,et al.  On the use of deep recurrent neural networks for detecting audio spoofing attacks , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[5]  Nicholas W. D. Evans,et al.  An end-to-end spoofing countermeasure for automatic speaker verification using evolving recurrent neural networks , 2018, Odyssey.

[6]  Bo Chen,et al.  Robust deep feature for spoofing detection - the SJTU system for ASVspoof 2015 challenge , 2015, INTERSPEECH.

[7]  Nicholas W. D. Evans,et al.  A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients , 2016, Odyssey.

[8]  Tomoki Toda,et al.  Anti-Spoofing for Text-Independent Speaker Verification: An Initial Database, Comparison of Countermeasures, and Human Performance , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[11]  Galina Lavrentyeva,et al.  Audio Replay Attack Detection with Deep Learning Frameworks , 2017, INTERSPEECH.

[12]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Christopher Joseph Pal,et al.  Delving Deeper into Convolutional Networks for Learning Video Representations , 2015, ICLR.

[14]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[15]  S. R. Mahadeva Prasanna,et al.  Spoof Detection Using Source, Instantaneous Frequency and Cepstral Features , 2017, INTERSPEECH.

[16]  DeLiang Wang,et al.  Speech intelligibility in background noise with ideal binary time-frequency masking. , 2009, The Journal of the Acoustical Society of America.

[17]  Jun Guo,et al.  Spoofing Detection in Automatic Speaker Verification Systems Using DNN Classifiers and Dynamic Acoustic Features , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Chi-Man Pun,et al.  Audio Replay Spoof Attack Detection Using Segment-based Hybrid Feature and DenseNet-LSTM Network , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Sébastien Marcel,et al.  End-to-End convolutional neural network-based voice presentation attack detection , 2017, 2017 IEEE International Joint Conference on Biometrics (IJCB).

[21]  Jakub Galka,et al.  Audio Replay Attack Detection Using High-Frequency Features , 2017, INTERSPEECH.

[22]  Kong-Aik Lee,et al.  Integrated Presentation Attack Detection and Automatic Speaker Verification: Common Features and Gaussian Back-end Fusion , 2018, INTERSPEECH.

[23]  Yifan Gong,et al.  An analysis of convolutional neural networks for speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[25]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[26]  Xiaolin Hu,et al.  Gated Recurrent Convolution Neural Network for OCR , 2017, NIPS.

[27]  Goutam Saha,et al.  Synthetic speech detection using fundamental frequency variation and spectral features , 2018, Comput. Speech Lang..

[28]  Haizhou Li,et al.  An Investigation of Spoofing Speech Detection Under Additive Noise and Reverberant Conditions , 2016, INTERSPEECH.

[29]  Sarthak Yadav,et al.  Learning Discriminative Features for Speaker Identification and Verification , 2018, INTERSPEECH.

[30]  Xuan Zhu,et al.  Feature Selection Based on CQCCs for Automatic Speaker Verification Spoofing , 2017, INTERSPEECH.

[31]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[32]  Zhifeng Xie,et al.  Recurrent Neural Networks for Automatic Replay Spoofing Attack Detection , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Vidhyasaharan Sethu,et al.  Deep Siamese Architecture Based Replay Detection for Secure Voice Biometric , 2018, INTERSPEECH.

[34]  Patrick Kenny,et al.  Boosting the Performance of Spoofing Detection Systems on Replay Attacks Using q-Logarithm Domain Feature Normalization , 2018, Odyssey.

[35]  Aleksandr Sizov,et al.  ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge , 2015, INTERSPEECH.

[36]  Sébastien Marcel,et al.  Long-Term Spectral Statistics for Voice Presentation Attack Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[37]  John H. L. Hansen,et al.  An Investigation of Deep-Learning Frameworks for Speaker Verification Antispoofing , 2017, IEEE Journal of Selected Topics in Signal Processing.

[38]  Kong-Aik Lee,et al.  t-DCF: a Detection Cost Function for the Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification , 2018, Odyssey.

[39]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[40]  Ji-Chen Yang,et al.  Feature with Complementarity of Statistics and Principal Information for Spoofing Detection , 2018, INTERSPEECH.

[41]  Kong-Aik Lee,et al.  ASVspoof 2017 Version 2.0: meta-data analysis and baseline enhancements , 2018, Odyssey.

[42]  Hassan Mathkour,et al.  Automatic Speaker Recognition for Mobile Forensic Applications , 2017, Mob. Inf. Syst..

[43]  Koichi Shinoda,et al.  Attentive Statistics Pooling for Deep Speaker Embedding , 2018, INTERSPEECH.

[44]  Simon King,et al.  Attentive Filtering Networks for Audio Replay Attack Detection , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Kong-Aik Lee,et al.  The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection , 2017, INTERSPEECH.

[46]  Galina Lavrentyeva,et al.  STC Antispoofing Systems for the ASVspoof2019 Challenge , 2019, INTERSPEECH.

[47]  Hema A. Murthy,et al.  The modified group delay function and its application to phoneme recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[48]  Nanxin Chen,et al.  ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual neTworks , 2019, INTERSPEECH.

[49]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[50]  Bernhard Schölkopf,et al.  Support Vector Method for Novelty Detection , 1999, NIPS.

[51]  Hye-jin Shim,et al.  Replay attack detection with complementary high-resolution information using end-to-end DNN for the ASVspoof 2019 Challenge , 2019, INTERSPEECH.

[52]  Qiang Huang,et al.  Convolutional gated recurrent neural network incorporating spatial features for audio tagging , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[53]  Zhizheng Wu,et al.  Deep Feature Engineering for Noise Robust Spoofing Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[54]  Aleksandr Sizov,et al.  Spoofing detection goes noisy: An analysis of synthetic speech detection in the presence of additive noise , 2016, Speech Commun..

[55]  Ming Li,et al.  Countermeasures for Automatic Speaker Verification Replay Spoofing Attack : On Data Augmentation, Feature Representation, Classification and Fusion , 2017, INTERSPEECH.

[56]  Ángel M. Gómez,et al.  A Deep Identity Representation for Noise Robust Spoofing Detection , 2018, INTERSPEECH.

[57]  Ángel M. Gómez,et al.  Performance evaluation of front- and back-end techniques for ASV spoofing detection systems based on deep features , 2018, IberSPEECH.

[58]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[59]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[60]  Tomi Kinnunen,et al.  ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection , 2019, INTERSPEECH.

[61]  Niko Brümmer,et al.  The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF , 2013, ArXiv.

[62]  Haizhou Li,et al.  Extended Constant-Q Cepstral Coefficients for Detection of Spoofing Attacks , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[63]  Prasenjit Dey,et al.  End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention , 2018, INTERSPEECH.

[64]  Yi Liu,et al.  Simultaneous utilization of spectral magnitude and phase information to extract supervectors for speaker verification anti-spoofing , 2015, INTERSPEECH.

[65]  Kai Yu,et al.  Deep features for automatic spoofing detection , 2016, Speech Communication.

[66]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.