Optimizing Tandem Speaker Verification and Anti-Spoofing Systems

As automatic speaker verification (ASV) systems are vulnerable to spoofing attacks, they are typically used in conjunction with spoofing countermeasure (CM) systems to improve security. For example, the CM can first determine whether the input is human speech, then the ASV can determine whether this speech matches the speaker’s identity. The performance of such a tandem system can be measured with a tandem detection cost function (t-DCF). However, ASV and CM systems are usually trained separately, using different metrics and data, which does not optimize their combined performance. In this work, we propose to optimize the tandem system directly by creating a differentiable version of t-DCF and employing techniques from reinforcement learning. The results indicate that these approaches offer better outcomes than finetuning, with our method providing a 20% relative improvement in the t-DCF in the ASVSpoof19 dataset in a constrained setting.

[1]  Aleksandr Sizov,et al.  Direct Optimization of the Detection Cost for I-Vector-Based Spoken Language Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Sabato Marco Siniscalchi,et al.  Deep learning with maximal figure-of-merit cost to advance multi-label speech attribute detection , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[3]  Sriram Ganapathy,et al.  NPLDA: A Deep Neural PLDA Model for Speaker Verification , 2020, ArXiv.

[4]  Lantao Yu,et al.  SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient , 2016, AAAI.

[5]  Joshua T. Abbott,et al.  Reinforcement Based Learning on Classification Task Could Yield Better Generalization and Adversarial Accuracy , 2020, ArXiv.

[6]  Tomi Kinnunen,et al.  ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection , 2019, INTERSPEECH.

[7]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[8]  Konstantin Böttinger,et al.  Speech is Silver, Silence is Golden: What do ASVspoof-trained Models Really Learn? , 2021, 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge.

[9]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[10]  Kong-Aik Lee,et al.  t-DCF: a Detection Cost Function for the Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification , 2018, Odyssey.

[11]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[12]  Sébastien Le Maguer,et al.  ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech , 2019, Comput. Speech Lang..

[13]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[14]  Niko Brümmer,et al.  Measuring, refining and calibrating speaker and language information extracted from speech , 2010 .

[15]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[16]  Douglas A. Reynolds,et al.  Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Junichi Yamagishi,et al.  An initial investigation on optimizing tandem speaker verification and countermeasure systems using reinforcement learning , 2020, ArXiv.

[18]  Douglas A. Reynolds,et al.  Two decades of speaker recognition evaluation at the national institute of standards and technology , 2020, Comput. Speech Lang..

[19]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[20]  Tomi Kinnunen,et al.  ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection , 2021, 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge.

[21]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[22]  Shuai Wang,et al.  BUT System Description to VoxCeleb Speaker Recognition Challenge 2019 , 2019, ArXiv.

[23]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[24]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[25]  Xin Wang,et al.  A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection , 2021, Interspeech.

[26]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[27]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[28]  WuWen,et al.  A maximal figure-of-merit (MFoM)-learning approach to robust classifier design for text categorization , 2006 .

[29]  Victoria Mingote,et al.  Optimization of False Acceptance/Rejection Rates and Decision Threshold for End-to-End Text-Dependent Speaker Verification Systems , 2019, INTERSPEECH.

[30]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[31]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[32]  Galina Lavrentyeva,et al.  STC Antispoofing Systems for the ASVspoof2019 Challenge , 2019, INTERSPEECH.

[33]  Kong-Aik Lee,et al.  The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection , 2017, INTERSPEECH.

[34]  Aleksandr Sizov,et al.  Unifying Probabilistic Linear Discriminant Analysis Variants in Biometric Authentication , 2014, S+SSPR.

[35]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[36]  Niko Brümmer,et al.  Application-independent evaluation of speaker detection , 2006, Comput. Speech Lang..

[37]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[38]  Peter Vary,et al.  A binaural room impulse response database for the evaluation of dereverberation algorithms , 2009, 2009 16th International Conference on Digital Signal Processing.

[39]  Niko Brümmer,et al.  A Speaker Verification Backend with Robust Performance across Conditions , 2021, Comput. Speech Lang..

[40]  Alan McCree,et al.  MagNetO: X-vector Magnitude Estimation Network plus Offset for Improved Speaker Recognition , 2020, Odyssey.

[41]  Niko Brümmer,et al.  A comparison of linear and non-linear calibrations for speaker recognition , 2014, Odyssey.