Synthetic speech detection through short-term and long-term prediction traces

Several methods for synthetic audio speech generation have been developed in the literature through the years. With the great technological advances brought by deep learning, many novel synthetic speech techniques achieving incredible realistic results have been recently proposed. As these methods generate convincing fake human voices, they can be used in a malicious way to negatively impact on today’s society (e.g., people impersonation, fake news spreading, opinion formation). For this reason, the ability of detecting whether a speech recording is synthetic or pristine is becoming an urgent necessity. In this work, we develop a synthetic speech detector. This takes as input an audio recording, extracts a series of hand-crafted features motivated by the speech-processing literature, and classify them in either closed-set or open-set. The proposed detector is validated on a publicly available dataset consisting of 17 synthetic speech generation algorithms ranging from old fashioned vocoders to modern deep learning solutions. Results show that the proposed method outperforms recently proposed detectors in the forensics literature.

[1]  Yannis Agiomyrgiannakis,et al.  Vocaine the vocoder and applications in speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Sébastien Le Maguer,et al.  ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech , 2019, Comput. Speech Lang..

[3]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[4]  Yu Tsao,et al.  Voice conversion from non-parallel corpora using variational auto-encoder , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[5]  Luisa Verdoliva,et al.  Media Forensics and DeepFakes: An Overview , 2020, IEEE Journal of Selected Topics in Signal Processing.

[6]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[7]  Vincenzo Lipari,et al.  "Hello? Who Am I Talking to?" A Shallow CNN Approach for Human vs. Bot Speech Classification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Tomoki Toda,et al.  Intra-gender statistical singing voice conversion with direct waveform modification using log-spectral differential , 2018, Speech Commun..

[9]  Marriott Marquis Hotel,et al.  The 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings , 1996 .

[10]  K. Sreenivasa Rao,et al.  Robust Pitch Extraction Method for the HMM-Based Speech Synthesis System , 2017, IEEE Signal Processing Letters.

[11]  Paolo Bestagini,et al.  Video Face Manipulation Detection Through Ensemble of CNNs , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[12]  Urbashi Mitra,et al.  2008 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , 2008 .

[13]  Gunnar Fant,et al.  The source filter concept in voice production , 1981 .

[14]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[15]  P. Alam ‘S’ , 2021, Composites Engineering: An A–Z Guide.

[16]  Edward J. Delp,et al.  Deepfake Video Detection Using Recurrent Neural Networks , 2018, 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[17]  IEEE conference on computer vision and pattern recognition , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[18]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[19]  Tomi Kinnunen,et al.  A comparison of features for synthetic speech detection , 2015, INTERSPEECH.

[20]  John H. L. Hansen,et al.  An Investigation of Deep-Learning Frameworks for Speaker Verification Antispoofing , 2017, IEEE Journal of Selected Topics in Signal Processing.

[21]  Michael Felsberg,et al.  22nd International Conference on Pattern Recognition (ICPR) , 2014 .

[22]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[23]  J. Franke A Levinson-Durbin recursion for autoregressive-moving average processes , 1985 .

[24]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[25]  Marc Schröder,et al.  Open Source Voice Creation Toolkit for the MARY TTS Platform , 2011, INTERSPEECH.

[26]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[27]  Siwei Lyu,et al.  Detecting AI-Synthesized Speech Using Bispectral Analysis , 2019, CVPR Workshops.

[28]  Haizhou Li,et al.  Advances in anti-spoofing: from the perspective of ASVspoof challenges , 2020, APSIPA Transactions on Signal and Information Processing.

[29]  Xin Wang,et al.  Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Kai Yu,et al.  End-to-end spoofing detection with raw waveform CLDNNS , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[32]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[33]  Soumya Priyadarsini Panda,et al.  A waveform concatenation technique for text-to-speech synthesis , 2017, Int. J. Speech Technol..

[34]  P. Alam ‘Z’ , 2021, Composites Engineering: An A–Z Guide.

[35]  Driss Matrouf,et al.  Effect of Speech Transformation on Impostor Acceptance , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[36]  Nicholas W. D. Evans,et al.  Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification , 2017, Comput. Speech Lang..

[37]  Kou Tanaka,et al.  WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation , 2019, ArXiv.

[38]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[39]  Lauri Juvela,et al.  A Comparison of Recent Waveform Generation and Acoustic Modeling Methods for Neural-Network-Based Speech Synthesis , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Siwei Lyu,et al.  In Ictu Oculi: Exposing AI Created Fake Videos by Detecting Eye Blinking , 2018, 2018 IEEE International Workshop on Information Forensics and Security (WIFS).

[41]  Nick Campbell,et al.  Optimising selection of units from speech databases for concatenative synthesis , 1995, EUROSPEECH.

[42]  2020 IEEE International Workshop on Information Forensics and Security (WIFS) , 2020 .

[43]  Christian Riess,et al.  Exploiting Visual Artifacts to Expose Deepfakes and Face Manipulations , 2019, 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW).