End-to-end anti-spoofing with RawNet2

Spoofing countermeasures aim to protect automatic speaker verification systems from attempts to manipulate their reliability with the use of spoofed speech signals. While results from the most recent ASVspoof 2019 evaluation show great potential to detect most forms of attack, some continue to evade detection. This paper reports the first application of RawNet2 to anti-spoofing. RawNet2 ingests raw audio and has potential to learn cues that are not detectable using more traditional countermeasure solutions. We describe modifications made to the original RawNet2 architecture so that it can be applied to anti-spoofing. For A17 attacks, our RawNet2 systems results are the second-best reported, while the fusion of RawNet2 and baseline countermeasures gives the second-best results reported for the full ASVspoof 2019 logical access condition. Our results are reproducible with open source software.

[1]  Yoshua Bengio,et al.  Learning Speaker Representations with Mutual Information , 2018, INTERSPEECH.

[2]  Hye-jin Shim,et al.  Improved RawNet with Filter-wise Rescaling for Text-independent Speaker Verification using Raw Waveforms , 2020, ArXiv.

[3]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[4]  Nicholas W. D. Evans,et al.  A one-class classification approach to generalised speaker verification spoofing countermeasures using local binary patterns , 2013, 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS).

[5]  Hemlata Tak,et al.  An explainability study of the constant Q cepstral coefficient spoofing countermeasure for automatic speaker verification , 2020, Odyssey.

[6]  Muhammad Awais,et al.  Spoofing Attack Detection by Anomaly Detection , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Douglas A. Reynolds,et al.  Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Hemlata Tak,et al.  Spoofing Attack Detection using the Non-linear Fusion of Sub-band Classifiers , 2020, INTERSPEECH.

[9]  Sébastien Marcel,et al.  Towards Directly Modeling Raw Speech Signal for Speaker Verification Using CNNS , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Josef Kittler,et al.  Combining Multiple one-class Classifiers for Anomaly based Face Spoofing Attack Detection , 2019, 2019 International Conference on Biometrics (ICB).

[11]  Tetsushi Ohki,et al.  Efficient Spoofing Attack Detection against Unknown Sample using End-to-End Anomaly Detection , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Tomi Kinnunen,et al.  ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection , 2019, INTERSPEECH.

[14]  Galina Lavrentyeva,et al.  STC Antispoofing Systems for the ASVspoof2019 Challenge , 2019, INTERSPEECH.

[15]  Joon Son Chung,et al.  Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..

[16]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[17]  Sébastien Le Maguer,et al.  ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech , 2019, Comput. Speech Lang..

[18]  Myung-Jae Kim,et al.  Advanced b-vector system based deep neural network as classifier for speaker verification , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Tomi Kinnunen,et al.  A comparison of features for synthetic speech detection , 2015, INTERSPEECH.

[20]  Niko Brümmer,et al.  The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF , 2013, ArXiv.

[21]  Hao Tang,et al.  Frame-Level Speaker Embeddings for Text-Independent Speaker Recognition and Analysis of End-to-End Model , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[22]  Koichi Shinoda,et al.  Attentive Statistics Pooling for Deep Speaker Embedding , 2018, INTERSPEECH.

[23]  Bob L. Sturm,et al.  Ensemble Models for Spoofing Detection in Automatic Speaker Verification , 2019, INTERSPEECH.

[24]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[25]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Hsin-Min Wang,et al.  Speaker verification using kernel-based binary classifiers with binary operation derived features , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Hye-jin Shim,et al.  A Complete End-to-End Speaker Verification System Using Deep Neural Networks: From Raw Signals to Verification Result , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Hye-jin Shim,et al.  RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification , 2019, INTERSPEECH.

[29]  Kong-Aik Lee,et al.  t-DCF: a Detection Cost Function for the Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification , 2018, Odyssey.