Towards End-to-End Synthetic Speech Detection

The constant Q transform (CQT) has been shown to be one of the most effective speech signal pre-transforms to facilitate synthetic speech detection, followed by either hand-crafted (subband) constant Q cepstral coefficient (CQCC) feature extraction and a back-end binary classifier, or a deep neural network (DNN) directly for further feature extraction and classification. Despite the rich literature on such a pipeline, we show in this paper that the pre-transform and hand-crafted features could simply be replaced by end-to-end DNNs. Specifically, we experimentally verify that by only using standard components, a light-weight neural network could outperform the state-of-the-art methods for the ASVspoof2019 challenge. The proposed model is termed Time-domain Synthetic Speech Detection Net (TSSDNet), having ResNet- or Inception-style structures. We further demonstrate that the proposed models also have attractive generalization capability. Trained on ASVspoof2019, they could achieve promising detection performance when tested on disjoint ASVspoof2015, significantly better than the existing cross-dataset results. This paper reveals the great potential of end-to-end DNNs for synthetic speech detection, without hand-crafted features.

[1]  Zhifeng Xie,et al.  ResNet and Model Fusion for Automatic Spoofing Detection , 2017, INTERSPEECH.

[2]  Nicholas W. D. Evans,et al.  Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification , 2017, Comput. Speech Lang..

[3]  Lukás Burget,et al.  Detecting Spoofing Attacks Using VGG and SincNet: BUT-Omilia Submission to ASVspoof 2019 Challenge , 2019, INTERSPEECH.

[4]  Heiga Zen,et al.  Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[5]  Tomi Kinnunen,et al.  ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection , 2019, INTERSPEECH.

[6]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[7]  Haizhou Li,et al.  Long Range Acoustic Features for Spoofed Speech Detection , 2019, INTERSPEECH.

[8]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Sébastien Marcel,et al.  A Cross-Database Study of Voice Presentation Attack Detection , 2019, Handbook of Biometric Anti-Spoofing, 2nd Ed..

[10]  Nanxin Chen,et al.  ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual neTworks , 2019, INTERSPEECH.

[11]  Sébastien Marcel,et al.  Long-Term Spectral Statistics for Voice Presentation Attack Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Zhizheng Wu,et al.  Deep Feature Engineering for Noise Robust Spoofing Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Hemant A. Patil,et al.  Cochlear Filter and Instantaneous Frequency Based Features for Spoofed Speech Detection , 2017, IEEE Journal of Selected Topics in Signal Processing.

[14]  Sébastien Le Maguer,et al.  ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech , 2019, Comput. Speech Lang..

[15]  Rohan Kumar Das,et al.  Extraction of Octave Spectra Information for Spoofing Attack Detection , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Hossein Sameti,et al.  Replay Spoofing Countermeasure Using Autoencoder and Siamese Network on ASVspoof 2019 Challenge , 2019, Comput. Speech Lang..

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  Haizhou Li,et al.  Significance of Subband Features for Synthetic Speech Detection , 2020, IEEE Transactions on Information Forensics and Security.

[20]  Tao Qin,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[21]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[22]  Haizhou Li,et al.  Spoofing speech detection using temporal convolutional neural network , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[23]  Aleksandr Sizov,et al.  ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge , 2015, INTERSPEECH.

[24]  Longbiao Wang,et al.  Spoofing Speech Detection Using Modified Relative Phase Information , 2017, IEEE Journal of Selected Topics in Signal Processing.

[25]  Yang Gao,et al.  Voice Impersonation Using Generative Adversarial Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Sébastien Marcel,et al.  End-to-End convolutional neural network-based voice presentation attack detection , 2017, 2017 IEEE International Joint Conference on Biometrics (IJCB).

[27]  Na Li,et al.  Replay and Synthetic Speech Detection with Res2Net Architecture , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Goutam Saha,et al.  Synthetic speech detection using fundamental frequency variation and spectral features , 2018, Comput. Speech Lang..

[29]  Bob L. Sturm,et al.  Ensemble Models for Spoofing Detection in Automatic Speaker Verification , 2019, INTERSPEECH.

[30]  Sébastien Marcel,et al.  Understanding and Visualizing Raw Waveform-Based CNNs , 2019, INTERSPEECH.

[31]  Galina Lavrentyeva,et al.  STC Antispoofing Systems for the ASVspoof2019 Challenge , 2019, INTERSPEECH.

[32]  Tomi Kinnunen,et al.  A comparison of features for synthetic speech detection , 2015, INTERSPEECH.

[33]  John H. L. Hansen,et al.  An Investigation of Deep-Learning Frameworks for Speaker Verification Antispoofing , 2017, IEEE Journal of Selected Topics in Signal Processing.

[34]  Goutam Saha,et al.  Generalization of spoofing countermeasures: A case study with ASVspoof 2015 and BTAS 2016 corpora , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Haizhou Li,et al.  Assessing the Scope of Generalized Countermeasures for Anti-Spoofing , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Kai Yu,et al.  Deep features for automatic spoofing detection , 2016, Speech Communication.

[38]  Sercan Ömer Arik,et al.  Neural Voice Cloning with a Few Samples , 2018, NeurIPS.

[39]  Hemant A. Patil,et al.  Significance of Source–Filter Interaction for Classification of Natural vs. Spoofed Speech , 2017, IEEE Journal of Selected Topics in Signal Processing.

[40]  Jon Sánchez,et al.  Toward a Universal Synthetic Speech Spoofing Detection Using Phase Information , 2015, IEEE Transactions on Information Forensics and Security.

[41]  Hemlata Tak,et al.  End-to-end anti-spoofing with RawNet2 , 2020 .

[42]  Haizhou Li,et al.  An Exemplar-Based Approach to Frequency Warping for Voice Conversion , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[43]  Jon Sánchez,et al.  Synthetic speech detection using phase information , 2016, Speech Commun..

[44]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[45]  Goutam Saha,et al.  Spectral Features for Synthetic Speech Detection , 2017, IEEE Journal of Selected Topics in Signal Processing.

[46]  Jun Guo,et al.  Spoofing Detection in Automatic Speaker Verification Systems Using DNN Classifiers and Dynamic Acoustic Features , 2018, IEEE Transactions on Neural Networks and Learning Systems.