A Kernel Density Estimation Based Loss Function and its Application to ASV-Spoofing Detection

Biometric systems are exposed to spoofing attacks which may compromise their security, and voice biometrics, also known as automatic speaker verification (ASV), is no exception. Replay, synthesis and voice conversion attacks cause false acceptances that can be detected by anti-spoofing systems. Recently, deep neural networks (DNNs) which extract embedding vectors have shown superior performance than conventional systems in both ASV and anti-spoofing tasks. In this work, we develop a new concept of loss function for training DNNs which is based on kernel density estimation (KDE) techniques. The proposed loss functions estimate the probability density function (pdf) of every training class in each mini-batch, and compute a log likelihood matrix between the embedding vectors and pdfs of all training classes within the mini-batch in order to obtain the KDE-based loss. To evaluate our proposal for spoofing detection, experiments were carried out on the recent ASVspoof 2019 corpus, including both logical and physical access scenarios. The experimental results show that training a DNN based anti-spoofing system with our proposed loss functions clearly outperforms the performance of the same system being trained with other well-known loss functions. Moreover, the results also show that the proposed loss functions are effective for different types of neural network architectures.

[1]  Antonio M. Peinado,et al.  Kernel-Based MMSE Multimedia Signal Reconstruction and Its Application to Spatial Error Concealment , 2014, IEEE Transactions on Multimedia.

[2]  Tomi Kinnunen,et al.  ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection , 2019, INTERSPEECH.

[3]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[4]  Xing Ji,et al.  CosFace: Large Margin Cosine Loss for Deep Face Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Vidhyasaharan Sethu,et al.  Deep Siamese Architecture Based Replay Detection for Secure Voice Biometric , 2018, INTERSPEECH.

[6]  Kong-Aik Lee,et al.  t-DCF: a Detection Cost Function for the Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification , 2018, Odyssey.

[7]  Chunlei Zhang,et al.  End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances , 2017, INTERSPEECH.

[8]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[9]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[10]  Xin Wang,et al.  Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Mani B. Srivastava,et al.  Deep Residual Neural Networks for Audio Spoofing Detection , 2019, INTERSPEECH.

[12]  Shengcai Liao,et al.  Deep Metric Learning for Person Re-identification , 2014, 2014 22nd International Conference on Pattern Recognition.

[13]  Ángel M. Gómez,et al.  A Light Convolutional GRU-RNN Deep Feature Extractor for ASV Spoofing Detection , 2019, INTERSPEECH.

[14]  Antonio M. Peinado,et al.  A Gated Recurrent Convolutional Neural Network for Robust Spoofing Detection , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Quan Wang,et al.  Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[17]  Ke Chen,et al.  Extracting Speaker-Specific Information with a Regularized Siamese Deep Network , 2011, NIPS.

[18]  Galina Lavrentyeva,et al.  Audio Replay Attack Detection with Deep Learning Frameworks , 2017, INTERSPEECH.

[19]  Kihyuk Sohn,et al.  Improved Deep Metric Learning with Multi-class N-pair Loss Objective , 2016, NIPS.

[20]  John H. L. Hansen,et al.  Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  V. A. Epanechnikov Non-Parametric Estimation of a Multivariate Probability Density , 1969 .

[23]  Ángel M. Gómez,et al.  A Deep Identity Representation for Noise Robust Spoofing Detection , 2018, INTERSPEECH.

[24]  Sharath Pankanti,et al.  Biometrics: a tool for information security , 2006, IEEE Transactions on Information Forensics and Security.

[25]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[26]  Galina Lavrentyeva,et al.  STC Antispoofing Systems for the ASVspoof2019 Challenge , 2019, INTERSPEECH.

[27]  Aamer Mehmood,et al.  Performance Evaluation of Various Functions for Kernel Density Estimation , 2013 .

[28]  Yimin Wang,et al.  Joint Decision of Anti-Spoofing and Automatic Speaker Verification by Multi-Task Learning With Contrastive Loss , 2020, IEEE Access.

[29]  H. Kile,et al.  Bandwidth Selection in Kernel Density Estimation , 2010 .

[30]  Michael Jones,et al.  An improved deep learning architecture for person re-identification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Kou Tanaka,et al.  Synthetic-to-Natural Speech Waveform Conversion Using Cycle-Consistent Adversarial Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[32]  Tomoki Toda,et al.  Statistical singing voice conversion with direct waveform modification based on the spectrum differential , 2014, INTERSPEECH.

[33]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[34]  Jian Cheng,et al.  Additive Margin Softmax for Face Verification , 2018, IEEE Signal Processing Letters.

[35]  Lukás Burget,et al.  Detecting Spoofing Attacks Using VGG and SincNet: BUT-Omilia Submission to ASVspoof 2019 Challenge , 2019, INTERSPEECH.

[36]  Ángel M. Gómez,et al.  Performance evaluation of front- and back-end techniques for ASV spoofing detection systems based on deep features , 2018, IberSPEECH.

[37]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[38]  Kong-Aik Lee,et al.  The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection , 2017, INTERSPEECH.

[39]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40]  Bhiksha Raj,et al.  SphereFace: Deep Hypersphere Embedding for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Wu-Jun Li,et al.  Ensemble Additive Margin Softmax for Speaker Verification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Chia-Ping Chen,et al.  Transfer-Representation Learning for Detecting Spoofing Attacks with Converted and Synthesized Speech in Automatic Speaker Verification System , 2019, INTERSPEECH.

[44]  Tomoki Toda,et al.  Anti-Spoofing for Text-Independent Speaker Verification: An Initial Database, Comparison of Countermeasures, and Human Performance , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[45]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[47]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[48]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[49]  Jiasong Sun,et al.  Angular Softmax Loss for End-to-end Speaker Verification , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[50]  Hye-jin Shim,et al.  End-to-end losses based on speaker basis vectors and all-speaker hard negative mining for speaker verification , 2019, INTERSPEECH.

[51]  Tieniu Tan,et al.  A Light CNN for Deep Face Representation With Noisy Labels , 2015, IEEE Transactions on Information Forensics and Security.

[52]  Anil Kumar Vuppala,et al.  IIIT-H Spoofing Countermeasures for Automatic Speaker Verification Spoofing and Countermeasures Challenge 2019 , 2019, INTERSPEECH.

[53]  Driss Matrouf,et al.  Effect of Speech Transformation on Impostor Acceptance , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[54]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[55]  Ravika Naika,et al.  An Overview of Automatic Speaker Verification System , 2018 .

[56]  James Hays,et al.  Localizing and Orienting Street Views Using Overhead Imagery , 2016, ECCV.

[57]  Nir Ailon,et al.  Deep Metric Learning Using Triplet Network , 2014, SIMBAD.