Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization

We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, e.g., loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as the mean aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.

[1]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[2]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[3]  Tat-Seng Chua,et al.  Automated localization of affective objects and actions in images via caption text-cum-eye gaze analysis , 2009, ACM Multimedia.

[4]  Siwei Lyu,et al.  Exposing DeepFake Videos By Detecting Face Warping Artifacts , 2018, CVPR Workshops.

[5]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Héctor M. Pérez Meana,et al.  Speaker recognition using Mel frequency Cepstral Coefficients (MFCC) and Vector quantization (VQ) techniques , 2012, CONIELECOMP 2012, 22nd International Conference on Electrical Communications and Computers.

[7]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[8]  Justus Thies,et al.  Face2Face: real-time face capture and reenactment of RGB videos , 2019, Commun. ACM.

[9]  Wael Abd-Almageed,et al.  BusterNet: Detecting Copy-Move Image Forgery with Source/Target Localization , 2018, ECCV.

[10]  Brian C. Lovell,et al.  Multi-Region Probabilistic Histograms for Robust and Scalable Identity Inference , 2009, ICB.

[11]  Dinesh Manocha,et al.  Emotions Don't Lie: An Audio-Visual Deepfake Detection Method using Affective Cues , 2020, ACM Multimedia.

[12]  Junichi Yamagishi,et al.  Multi-task Learning for Detecting and Segmenting Manipulated Facial Images and Videos , 2019, 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS).

[13]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[14]  Hynek Hermansky,et al.  Automatic Speech Recognition: an Auditory Perspective , 2004 .

[15]  Junichi Yamagishi,et al.  MesoNet: a Compact Facial Video Forgery Detection Network , 2018, 2018 IEEE International Workshop on Information Forensics and Security (WIFS).

[16]  L. Floridi Artificial Intelligence, Deepfakes and a Future of Ectypes , 2018, Philosophy & Technology.

[17]  Aya Yadlin‐Segal,et al.  Whose dystopia is it anyway? Deepfakes and social media regulation , 2020, Convergence: The International Journal of Research into New Media Technologies.

[18]  Christian Riess,et al.  Exploiting Visual Artifacts to Expose Deepfakes and Face Manipulations , 2019, 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW).

[19]  Edward J. Delp,et al.  Deepfake Video Detection Using Recurrent Neural Networks , 2018, 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[20]  T. Grimes,et al.  Mild auditory-visual dissonance in television news may exceed viewer attentional capacity , 1991 .

[21]  Cristian Canton-Ferrer,et al.  The Deepfake Detection Challenge (DFDC) Preview Dataset , 2019, ArXiv.

[22]  Sébastien Marcel,et al.  DeepFakes: a New Threat to Face Recognition? Assessment and Detection , 2018, ArXiv.

[23]  Shifeng Zhang,et al.  S^3FD: Single Shot Scale-Invariant Face Detector , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Vineet Gandhi,et al.  GAZED– Gaze-guided Cinematic Editing of Wide-Angle Monocular Video Recordings , 2020, CHI.

[25]  Hao Li,et al.  Protecting World Leaders Against Deep Fakes , 2019, CVPR Workshops.

[26]  Junichi Yamagishi,et al.  Capsule-forensics: Using Capsule Networks to Detect Forged Images and Videos , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Dinesh Manocha,et al.  Emotions Don't Lie: A Deepfake Detection Method using Audio-Visual Affective Cues , 2020, ArXiv.

[28]  Marian Stewart Bartlett,et al.  Classification and weakly supervised pain localization using multiple segment representation , 2014, Image Vis. Comput..

[29]  Siwei Lyu,et al.  In Ictu Oculi: Exposing AI Created Fake Videos by Detecting Eye Blinking , 2018, 2018 IEEE International Workshop on Information Forensics and Security (WIFS).

[30]  Andreas Rössler,et al.  FaceForensics++: Learning to Detect Manipulated Facial Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Larry S. Davis,et al.  Two-Stream Neural Networks for Tampered Face Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[32]  Xin Yang,et al.  Exposing Deep Fakes Using Inconsistent Head Poses , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Mohan S. Kankanhalli,et al.  EEG-based Evaluation of Cognitive Workload Induced by Acoustic Parameters for Data Sonification , 2018, ICMI.