An explainability study of the constant Q cepstral coefficient spoofing countermeasure for automatic speaker verification

Anti-spoofing for automatic speaker verification is now a well established area of research, with three competitive challenges having been held in the last 6 years. A great deal of research effort over this time has been invested into the development of front-end representations tailored to the spoofing detection task. One such approach known as constant Q cepstral coefficients (CQCCs) have been shown to be especially effective in detecting attacks implemented with a unit selection based speech synthesis algorithm. Despite their success, they largely fail in detecting other forms of spoofing attack where more traditional front-end representations give substantially better results. Similar differences were also observed in the most recent, 2019 edition of the ASVspoof challenge series. This paper reports our attempts to help explain these observations. The explanation is shown to lie in the level of attention paid by each front-end to different sub-band components of the spectrum. Thus far, surprisingly little has been learned about what artefacts are being detected by spoofing countermeasures. Our work hence aims to shed light upon signal or spectrum level artefacts that serve to distinguish different forms of spoofing attack from genuine, bone fide speech. With a better understanding of these artefacts we will be better positioned to design more reliable countermeasures.

[1]  Nicholas Evans Spoofing and countermeasures for speaker verification: a need for standard corpora, protocols and metrics , 2013 .

[2]  Lauri Juvela,et al.  The ASVspoof 2019 database , 2019, ArXiv.

[3]  Tomoki Toda,et al.  Generalization of Spectrum Differential based Direct Waveform Modification for Voice Conversion , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[4]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[5]  Li-Rong Dai,et al.  WaveNet Vocoder with Limited Training Data for Voice Conversion , 2018, INTERSPEECH.

[6]  Keikichi Hirose,et al.  One-to-Many Voice Conversion Based on Tensor Representation of Speaker Space , 2011, INTERSPEECH.

[7]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[8]  Galina Lavrentyeva,et al.  STC Antispoofing Systems for the ASVspoof2019 Challenge , 2019, INTERSPEECH.

[9]  Kai Yu,et al.  The SJTU Robust Anti-Spoofing System for the ASVspoof 2019 Challenge , 2019, INTERSPEECH.

[10]  Vidhyasaharan Sethu,et al.  Investigation of Sub-Band Discriminative Information Between Spoofed and Genuine Speech , 2016, INTERSPEECH.

[11]  Morgan Sonderegger,et al.  Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[12]  Richard S. Zemel,et al.  Generative Moment Matching Networks , 2015, ICML.

[13]  Tomoki Toda,et al.  Intra-gender statistical singing voice conversion with direct waveform modification using log-spectral differential , 2018, Speech Commun..

[14]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[15]  Tomi Kinnunen,et al.  Utterance Verification for Text-Dependent Speaker Recognition: A Comparative Assessment Using the RedDots Corpus , 2016, INTERSPEECH.

[16]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[17]  John H. L. Hansen,et al.  Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[18]  Anil Kumar Vuppala,et al.  IIIT-H Spoofing Countermeasures for Automatic Speaker Verification Spoofing and Countermeasures Challenge 2019 , 2019, INTERSPEECH.

[19]  Haizhou Li,et al.  Significance of Subband Features for Synthetic Speech Detection , 2020, IEEE Transactions on Information Forensics and Security.

[20]  Bob L. Sturm,et al.  Ensemble Models for Spoofing Detection in Automatic Speaker Verification , 2019, INTERSPEECH.

[21]  Judith C. Brown Calculation of a constant Q spectral transform , 1991 .

[22]  Nicholas W. D. Evans,et al.  A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients , 2016, Odyssey.

[23]  Steven F. Boll,et al.  Constant-Q signal analysis and synthesis , 1978, ICASSP.

[24]  J C Brown Computer identification of musical instruments using pattern recognition with cepstral coefficients as features. , 1999, The Journal of the Acoustical Society of America.

[25]  Lauri Juvela,et al.  ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech , 2019, Comput. Speech Lang..

[26]  Tomi Kinnunen,et al.  Spoofing and countermeasures for automatic speaker verification , 2013, INTERSPEECH.

[27]  Kou Tanaka,et al.  WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation , 2019, ArXiv.

[28]  Thomas Quatieri,et al.  Discrete-Time Speech Signal Processing: Principles and Practice , 2001 .

[29]  Aleksandr Sizov,et al.  ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge , 2015, INTERSPEECH.

[30]  Nicholas W. D. Evans,et al.  Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification , 2017, Comput. Speech Lang..

[31]  Tomi Kinnunen,et al.  ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection , 2019, INTERSPEECH.

[32]  Anssi Klapuri,et al.  A Matlab Toolbox for Efficient Perfect Reconstruction Time-Frequency Transforms with Log-Frequency Resolution , 2014, Semantic Audio.

[33]  Kong-Aik Lee,et al.  t-DCF: a Detection Cost Function for the Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification , 2018, Odyssey.

[34]  Thomas Grill,et al.  CONSTRUCTING AN INVERTIBLE CONSTANT-Q TRANSFORM WITH NONSTATIONARY GABOR FRAMES , 2011 .

[35]  Niko Brümmer,et al.  The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF , 2013, ArXiv.