FoR: A Dataset for Synthetic Speech Detection

With the advancements in deep learning and other techniques, synthetic speech is getting closer to a natural sounding voice. Some of the state-of-art technologies achieve such a high level of naturalness that even humans have difficulties distinguishing real speech from computer generated speech. Moreover, these technologies allow a person to train a speech synthesizer with a target voice, creating a model that is able to reproduce someone’s voice with high fidelity.In this paper, we introduce the FoR Dataset, which contains more than 198,000 utterances from the latest deep-learning speech synthesizers as well as real speech. This dataset can be used as base for several studies in speech synthesis and synthetic speech detection. Due to its large amount of utterances, it is pertinent for machine learning studies, since it is able to train even complex deep learning models without overfitting. We present several experiments using this dataset, including a deep learning classifier that reached up to 99.96% accuracy in synthetic speech detection.

[1]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[2]  Tomi Kinnunen,et al.  A comparison of features for synthetic speech detection , 2015, INTERSPEECH.

[3]  W. Pitts,et al.  A Logical Calculus of the Ideas Immanent in Nervous Activity (1943) , 2021, Ideas That Created the Future.

[4]  Kai Yu,et al.  Investigating Raw Wave Deep Neural Networks for End-to-End Speaker Spoofing Detection , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  John H. L. Hansen,et al.  An Investigation of Deep-Learning Frameworks for Speaker Verification Antispoofing , 2017, IEEE Journal of Selected Topics in Signal Processing.

[6]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[7]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[8]  Tomoki Toda,et al.  Anti-Spoofing for Text-Independent Speaker Verification: An Initial Database, Comparison of Countermeasures, and Human Performance , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Goutam Saha,et al.  Spectral Features for Synthetic Speech Detection , 2017, IEEE Journal of Selected Topics in Signal Processing.

[10]  Haizhou Li,et al.  Synthetic speech detection using temporal modulation feature , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Jun Guo,et al.  Effect of multi-condition training and speech enhancement methods on spoofing detection , 2016, 2016 First International Workshop on Sensing, Processing and Learning for Intelligent Machines (SPLINE).

[12]  Sébastien Marcel,et al.  End-to-End convolutional neural network-based voice presentation attack detection , 2017, 2017 IEEE International Joint Conference on Biometrics (IJCB).

[13]  Haizhou Li,et al.  Spoofing speech detection using temporal convolutional neural network , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[14]  Jun Guo,et al.  Spoofing Detection in Automatic Speaker Verification Systems Using DNN Classifiers and Dynamic Acoustic Features , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[15]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[16]  Harris Drucker,et al.  Learning algorithms for classification: A comparison on handwritten digit recognition , 1995 .

[17]  Zhizheng Wu,et al.  Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015) Database , 2014 .

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[20]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Kong-Aik Lee,et al.  The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection , 2017, INTERSPEECH.

[22]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23]  Sercan Ömer Arik,et al.  Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.