DeepSonar: Towards Effective and Robust Detection of AI-Synthesized Fake Voices

With the recent advances in voice synthesis, AI-synthesized fake voices are indistinguishable to human ears and widely are applied to produce realistic and natural DeepFakes, exhibiting real threats to our society. However, effective and robust detectors for synthesized fake voices are still in their infancy and are not ready to fully tackle this emerging threat. In this paper, we devise a novel approach, named DeepSonar, based on monitoring neuron behaviors of speaker recognition (SR) system, i.e., a deep neural network (DNN), to discern AI-synthesized fake voices. Layer-wise neuron behaviors provide an important insight to meticulously catch the differences among inputs, which are widely employed for building safety, robust, and interpretable DNNs. In this work, we leverage the power of layer-wise neuron activation patterns with a conjecture that they can capture the subtle differences between real and AI-synthesized fake voices, in providing a cleaner signal to classifiers than raw inputs. Experiments are conducted on three datasets (including commercial products from Google, Baidu, etc) containing both English and Chinese languages to corroborate the high detection rates (98.1% average accuracy) and low false alarm rates (about 2% error rate) of DeepSonar in discerning fake voices. Furthermore, extensive experimental results also demonstrate its robustness against manipulation attacks (e.g., voice conversion and additive real-world noises). Our work further poses a new insight into adopting neuron behaviors for effective and robust AI aided multimedia fakes forensics as an inside-out approach instead of being motivated and swayed by various artifacts introduced in synthesizing fakes.

[1]  C. Bishop Mixture density networks , 1994 .

[2]  S. Srihari Mixture Density Networks , 1994 .

[3]  Deborah Silver,et al.  Feature Visualization , 1994, Scientific Visualization.

[4]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[5]  郭文 The ancient Chinese poetry , 2009 .

[6]  Paul Taylor,et al.  Text-to-Speech Synthesis , 2009 .

[7]  Bruce E. Koenig,et al.  Forensic Authenticity Analyses of the Header Data in Re-Encoded WMA Files from Small Olympus Audio Recorders , 2012 .

[8]  Jean Schoentgen,et al.  Physics-based synthesis of disordered voices , 2013, INTERSPEECH.

[9]  Helen M. Meng,et al.  Multi-distribution deep belief network for speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Hong Zhao,et al.  Audio Recording Location Identification Using Acoustic Environment Signature , 2013, IEEE Transactions on Information Forensics and Security.

[11]  Dong Yu,et al.  Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Andrea Vedaldi,et al.  Understanding deep image representations by inverting them , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[14]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[15]  Tomoki Toda,et al.  The Voice Conversion Challenge 2016 , 2016, INTERSPEECH.

[16]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[17]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[18]  Muhammad Khurram Khan,et al.  Digital multimedia audio forensics: past, present and future , 2017, Multimedia Tools and Applications.

[19]  Marios Savvides,et al.  Simultaneous forgery identification and localization in paintings using advanced correlation filters , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[20]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[21]  Junfeng Yang,et al.  DeepXplore: Automated Whitebox Testing of Deep Learning Systems , 2017, SOSP.

[22]  Lianhong Cai,et al.  Multi-task learning of structured output layer bidirectional LSTMS for speech synthesis , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[24]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[25]  Wen-Chuan Lee,et al.  MODE: automated neural network model debugging via state differential analysis and input selection , 2018, ESEC/SIGSOFT FSE.

[26]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Xiangyu Zhang,et al.  Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples , 2018, NeurIPS.

[28]  Yu Gu,et al.  Multi-task WaveNet: A Multi-task Generative Model for Statistical Parametric Speech Synthesis without Fundamental Frequency Conditions , 2018, INTERSPEECH.

[29]  Lei Ma,et al.  DeepMutation: Mutation Testing of Deep Learning Systems , 2018, 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE).

[30]  Tomoki Toda,et al.  sprocket: Open-Source Voice Conversion Software , 2018, Odyssey.

[31]  Lei Ma,et al.  DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[32]  Junichi Yamagishi,et al.  The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods , 2018, Odyssey.

[33]  Pan He,et al.  Adversarial Examples: Attacks and Defenses for Deep Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[34]  Ian Goodfellow,et al.  TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing , 2018, ICML.

[35]  Felix Juefei-Xu,et al.  FakeSpotter: A Simple yet Robust Baseline for Spotting AI-Synthesized Fake Faces , 2019, IJCAI.

[36]  Siwei Lyu,et al.  Detecting AI-Synthesized Speech Using Bispectral Analysis , 2019, CVPR Workshops.

[37]  Siwei Lyu,et al.  Exposing DeepFake Videos By Detecting Face Warping Artifacts , 2018, CVPR Workshops.

[38]  Tomi Kinnunen,et al.  ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection , 2019, INTERSPEECH.

[39]  Wen-Chuan Lee,et al.  NIC: Detecting Adversarial Samples with Neural Network Invariant Checking , 2019, NDSS.

[40]  Lei Ma,et al.  DeepCT: Tomographic Combinatorial Testing for Deep Learning Systems , 2019, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[41]  Lei Ma,et al.  DeepHunter: a coverage-guided fuzz testing framework for deep neural networks , 2019, ISSTA.

[42]  Telecommunications Board Implications of Artificial Intelligence for Cybersecurity: Proceedings of a Workshop , 2019 .

[43]  C. Olah,et al.  Activation Atlas , 2019, Distill.

[44]  Alexei A. Efros,et al.  Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Xin Yang,et al.  Exposing Deep Fakes Using Inconsistent Head Poses , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Joon Son Chung,et al.  Utterance-level Aggregation for Speaker Recognition in the Wild , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Run Wang,et al.  FakePolisher: Making DeepFakes More Detection-Evasive by Shallow Reconstruction , 2020, ACM Multimedia.

[48]  DeepRhythm , 2020, Proceedings of the 28th ACM International Conference on Multimedia.

[49]  Chen Change Loy,et al.  DeeperForensics-1.0: A Large-Scale Dataset for Real-World Face Forgery Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Tero Karras,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Lei Ma,et al.  DeepRhythm: Exposing DeepFakes with Attentional Visual Heartbeat Rhythms , 2020, ACM Multimedia.

[52]  A. Morales,et al.  DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection , 2020, Inf. Fusion.

[53]  Yang Liu,et al.  FakeSpotter: A Simple yet Robust Baseline for Spotting AI-Synthesized Fake Faces , 2019, IJCAI.

[54]  Yisroel Mirsky,et al.  The Creation and Detection of Deepfakes , 2020, ACM Comput. Surv..

[55]  Lei Ma,et al.  FakeLocator: Robust Localization of GAN-Based Face Manipulations , 2020, IEEE Transactions on Information Forensics and Security.