How Deep Are the Fakes? Focusing on Audio Deepfake: A Survey

Deepfake is content or material that is synthetically generated or manipulated using artificial intelligence (AI) methods, to be passed off as real and can include audio, video, image, and text synthesis. This survey has been conducted with a different perspective compared to existing survey papers, that mostly focus on just video and image deepfakes. This survey not only evaluates generation and detection methods in the different deepfake categories, but mainly focuses on audio deepfakes that are overlooked in most of the existing surveys. This paper’s most important contribution is to critically analyze and provide a unique source of audio deepfake research, mostly ranging from 2016 to 2020. To the best of our knowledge, this is the first survey focusing on audio deepfakes in English. This survey provides readers with a summary of 1) different deepfake categories 2) how they could be created and detected 3) the most recent trends in this domain and shortcomings in detection methods 4) audio deepfakes, how they are created and detected in more detail which is the main focus of this paper. We found that Generative Adversarial Networks(GAN), Convolutional Neural Networks (CNN), and Deep Neural Networks (DNN) are common ways of creating and detecting deepfakes. In our evaluation of over 140 methods we found that the majority of the focus is on video deepfakes and in particular in the generation of video deepfakes. We found that for text deepfakes there are more generation methods but very few robust methods for detection, including fake news detection, which has become a controversial area of research because of the potential of heavy overlaps with human generation of fake content. This paper is an abbreviated version of the full survey and reveals a clear need to research audio deepfakes and particularly detection of audio deepfakes.

[1]  Chi-Man Pun,et al.  Audio Replay Spoof Attack Detection Using Segment-based Hybrid Feature and DenseNet-LSTM Network , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[3]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Kou Tanaka,et al.  StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[5]  Aythami Morales,et al.  DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection , 2020, Inf. Fusion.

[6]  Junichi Yamagishi,et al.  High-Quality Nonparallel Voice Conversion Based on Cycle-Consistent Adversarial Network , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Gautham J. Mysore,et al.  VoCo , 2017, ACM Trans. Graph..

[8]  Zhifeng Xie,et al.  ResNet and Model Fusion for Automatic Spoofing Detection , 2017, INTERSPEECH.

[9]  Jaakko Lehtinen,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Mahdieh Soleymani Baghshah,et al.  DGSAN: Discrete Generative Self-Adversarial Network , 2019, Neurocomputing.

[11]  Mike Lewis,et al.  MelNet: A Generative Model for Audio in the Frequency Domain , 2019, ArXiv.

[12]  Eduardo Lleida,et al.  Preventing replay attacks on speaker verification systems , 2011, 2011 Carnahan Conference on Security Technology.

[13]  Saeid Nahavandi,et al.  Deep learning for deepfakes creation and detection: A survey , 2019, Comput. Vis. Image Underst..

[14]  Vimal Kumar,et al.  Combating Deepfakes: Multi-LSTM and Blockchain as Proof of Authenticity for Digital Media , 2020, 2020 IEEE / ITU International Conference on Artificial Intelligence for Good (AI4G).

[15]  Chi-Man Pun,et al.  Audio Replay Spoof Attack Detection by Joint Segment-Based Linear Filter Bank Feature Extraction and Attention-Enhanced DenseNet-BiLSTM Network , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Mani B. Srivastava,et al.  Deep Residual Neural Networks for Audio Spoofing Detection , 2019, INTERSPEECH.

[17]  Nandakumar Paramparambath,et al.  Audio Spoofing Verification using Deep Convolutional Neural Networks by Transfer Learning , 2020, ArXiv.

[18]  Tiago M. Fernández-Caramés,et al.  Fake News, Disinformation, and Deepfakes: Leveraging Distributed Ledger Technologies and Blockchain to Combat Digital Deception and Counterfeit Reality , 2019, IT Professional.

[19]  Joaquín González-Rodríguez,et al.  An Audio Fingerprinting Approach to Replay Attack Detection on ASVSPOOF 2017 Challenge Data , 2018, Odyssey.

[20]  Victor Lempitsky,et al.  Few-Shot Adversarial Learning of Realistic Neural Talking Head Models , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Yoshinori Sagisaka,et al.  Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks , 1995, Speech Commun..

[22]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Simon Lui,et al.  Toward Robust Audio Spoofing Detection: A Detailed Comparison of Traditional and Learned Features , 2019, IEEE Access.

[24]  Li-Rong Dai,et al.  Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Haizhou Li,et al.  SINGAN: Singing Voice Conversion with Generative Adversarial Networks , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[26]  Chuan Li,et al.  Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Yang Yang,et al.  FGGAN: Feature-Guiding Generative Adversarial Networks for Text Generation , 2020, IEEE Access.

[28]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[29]  Shao-Liang Chang,et al.  A Trusting News Ecosystem Against Fake News from Humanity and Technology Perspectives , 2019, 2019 19th International Conference on Computational Science and Its Applications (ICCSA).

[30]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[31]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Liang Zhang,et al.  Deep Learning in Face Synthesis: A Survey on Deepfakes , 2020, 2020 IEEE 3rd International Conference on Computer and Communication Engineering Technology (CCET).

[33]  Simone Scardapane,et al.  On the use of deep recurrent neural networks for detecting audio spoofing attacks , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[34]  Jaehyeon Kim,et al.  HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , 2020, NeurIPS.

[35]  Sercan Ömer Arik,et al.  Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[36]  Galina Lavrentyeva,et al.  Audio Replay Attack Detection with Deep Learning Frameworks , 2017, INTERSPEECH.

[37]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[38]  Qingyang Wu,et al.  TextGAIL: Generative Adversarial Imitation Learning for Text Generation , 2020, AAAI.

[39]  Yang Gao,et al.  Voice Impersonation Using Generative Adversarial Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Wei Sun,et al.  Combating Replay Attacks Against Voice Assistants , 2019, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[41]  Saniat Javid Sohrawardi,et al.  Recurrent Convolutional Structures for Audio Spoof and Video Deepfake Detection , 2020, IEEE Journal of Selected Topics in Signal Processing.

[42]  Thomas Fang Zheng,et al.  A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification , 2017, INTERSPEECH.

[43]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[44]  João Paulo Papa,et al.  A survey on text generation using generative adversarial networks , 2021, Pattern Recognit..

[45]  Dinesh Manocha,et al.  Emotions Don't Lie: An Audio-Visual Deepfake Detection Method using Affective Cues , 2020, ACM Multimedia.

[46]  R. Parizi,et al.  Making Sense of Blockchain for AI Deepfakes Technology , 2020, 2020 IEEE Globecom Workshops (GC Wkshps.

[47]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[48]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[49]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Simon King,et al.  Attentive Filtering Networks for Audio Replay Attack Detection , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Xiangui Kang,et al.  A Capsule Network Based Approach for Detection of Audio Spoofing Attacks , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52]  Jakub Galka,et al.  Audio Replay Attack Detection Using High-Frequency Features , 2017, INTERSPEECH.

[53]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Kong-Aik Lee,et al.  The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection , 2017, INTERSPEECH.

[55]  Prasenjit Dey,et al.  End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention , 2018, INTERSPEECH.

[56]  Alexei A. Efros,et al.  Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[57]  Ganesh Sivaraman,et al.  Generalization of Audio Deepfake Detection , 2020, Odyssey.

[58]  Jacek Naruniec,et al.  High‐Resolution Neural Face Swapping for Visual Effects , 2020, Comput. Graph. Forum.

[59]  Xiongwei Zhang,et al.  Attention-Based LSTM Algorithm for Audio Replay Detection in Noisy Environments , 2019 .

[60]  Zhiyao Duan,et al.  One-Class Learning Towards Synthetic Voice Spoofing Detection , 2020, IEEE Signal Processing Letters.

[61]  Jason Lee,et al.  Fully Character-Level Neural Machine Translation without Explicit Segmentation , 2016, TACL.

[62]  Shinnosuke Takamichi,et al.  Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[63]  Felix Juefei-Xu,et al.  FakeSpotter: A Simple yet Robust Baseline for Spotting AI-Synthesized Fake Faces , 2019, IJCAI.

[64]  Bayya Yegnanarayana,et al.  Transformation of formants for voice conversion using artificial neural networks , 1995, Speech Commun..

[65]  Yisroel Mirsky,et al.  The Creation and Detection of Deepfakes , 2020, ACM Comput. Surv..