论文信息 - Spoofing Speaker Verification Systems with Deep Multi-speaker Text-to-speech Synthesis

Spoofing Speaker Verification Systems with Deep Multi-speaker Text-to-speech Synthesis

This paper proposes a deep multi-speaker text-to-speech (TTS) model for spoofing speaker verification (SV) systems. The proposed model employs one network to synthesize time-downsampled mel-spectrograms from text input and another network to convert them to linear-frequency spectrograms, which are further converted to the time domain using the Griffin-Lim algorithm. Both networks are trained separately under the generative adversarial networks (GAN) framework. Spoofing experiments on two state-of-the-art SV systems (i-vectors and Google's GE2E) show that the proposed system can successfully spoof these systems with a high success rate. Spoofing experiments on anti-spoofing systems (i.e., binary classifiers for discriminating real and synthetic speech) also show a high spoof success rate when such anti-spoofing systems' structures are exposed to the proposed TTS system.

Zhiyao Duan | Mingrui Yuan

[1] Shinnosuke Takamichi,et al. Training algorithm to deceive Anti-Spoofing Verification for DNN-based speech synthesis , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Quan Wang,et al. Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[4] Xiao Liu,et al. Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[5] Jürgen Schmidhuber,et al. Highway Networks , 2015, ArXiv.

[6] Junichi Yamagishi,et al. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2017 .

[7] Rafael Valle,et al. Attacking Speaker Recognition With Deep Generative Models , 2018, ArXiv.

[8] Alan W. Black,et al. Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[9] Tomi Kinnunen,et al. ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection , 2019, INTERSPEECH.

[10] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[11] Shinnosuke Takamichi,et al. Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12] Aaron C. Courville,et al. Improved Training of Wasserstein GANs , 2017, NIPS.

[13] Douglas A. Reynolds,et al. Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[14] Junichi Yamagishi,et al. SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .

[15] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16] Keiichi Tokuda,et al. Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[17] Alex Hirschfield,et al. Toward a dynamic framework for security evaluation of voice verification systems , 2009, 2009 IEEE Toronto International Conference Science and Technology for Humanity (TIC-STH).

[18] Hideyuki Tachibana,et al. Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Donald G. Childers,et al. Formant speech synthesis: improving production quality , 1989, IEEE Trans. Acoust. Speech Signal Process..

[20] Jae S. Lim,et al. Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[21] Kong-Aik Lee,et al. Introduction to Voice Presentation Attack Detection and Recent Advances , 2019, Handbook of Biometric Anti-Spoofing, 2nd Ed..

[22] Samy Bengio,et al. Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[23] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.