Emotions Don't Lie: An Audio-Visual Deepfake Detection Method using Affective Cues

We present a learning-based method for detecting real and fake deepfake multimedia content. To maximize information for learning, we extract and analyze the similarity between the two audio and visual modalities from within the same video. Additionally, we extract and compare affective cues corresponding to perceived emotion from the two modalities within a video to infer whether the input video is "real" or "fake". We propose a deep learning network, inspired by the Siamese network architecture and the triplet loss. To validate our model, we report the AUC metric on two large-scale deepfake detection datasets, DeepFake-TIMIT Dataset and DFDC. We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets, respectively. To the best of our knowledge, ours is the first approach that simultaneously exploits audio and video modalities and also perceived emotions from the two modalities for deepfake detection.

[1]  D A Sanders,et al.  The relative contribution of visual and auditory components of speech to speech intelligibility as a function of three conditions of frequency distortion. , 1971, Journal of speech and hearing research.

[2]  Z. Jane Wang,et al.  Densely Connected Convolutional Neural Network for Multi-purpose Image Forensics under Anti-forensic Attacks , 2018, IH&MMSec.

[3]  John J. Foxe,et al.  Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments. , 2006, Cerebral cortex.

[4]  Conrad Sanderson,et al.  The VidTIMIT Database , 2002 .

[5]  Kah Phooi Seng,et al.  Facial Emotion Recognition for Intelligent Tutoring Environment , 2022 .

[6]  Cristian Canton-Ferrer,et al.  The Deepfake Detection Challenge (DFDC) Preview Dataset , 2019, ArXiv.

[7]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Grigoriy Sterling,et al.  Emotion Recognition From Speech With Recurrent Neural Networks , 2017, ArXiv.

[9]  Sébastien Marcel,et al.  DeepFakes: a New Threat to Face Recognition? Assessment and Detection , 2018, ArXiv.

[10]  Chuang Wang,et al.  The influence of affective cues on positive emotion in predicting instant information sharing on microblogs: Gender as a moderator , 2017, Inf. Process. Manag..

[11]  Nicu Sebe,et al.  Affective multimodal human-computer interaction , 2005, ACM Multimedia.

[12]  Yiannis Kompatsiaris,et al.  Web Video Verification using Contextual Cues , 2017, MFSec@ICMR.

[13]  Erik Cambria,et al.  Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[14]  Robert M. Chesney,et al.  Deep Fakes: A Looming Challenge for Privacy, Democracy, and National Security , 2018 .

[15]  Larry S. Davis,et al.  Two-Stream Neural Networks for Tampered Face Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[16]  Jeffrey P. Bigham,et al.  VizWiz: nearly real-time answers to visual questions , 2010, W4A.

[17]  K O LeeMatthew,et al.  The influence of affective cues on positive emotion in predicting instant information sharing on microblogs , 2017 .

[18]  Siwei Lyu,et al.  Exposing DeepFake Videos By Detecting Face Warping Artifacts , 2018, CVPR Workshops.

[19]  Shaogang Gong,et al.  Beyond Facial Expressions: Learning Human Emotion from Body Gestures , 2007, BMVC.

[20]  Theodoros Giannakopoulos pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis , 2015, PloS one.

[21]  Dinesh Manocha,et al.  M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues , 2020, AAAI.

[22]  Linqin Cai,et al.  Audio-Textual Emotion Recognition Based on Improved Neural Networks , 2019 .

[23]  Junichi Yamagishi,et al.  Capsule-forensics: Using Capsule Networks to Detect Forged Images and Videos , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Junichi Yamagishi,et al.  Multi-task Learning for Detecting and Segmenting Manipulated Facial Images and Videos , 2019, 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS).

[25]  Konstantinos Bougiatiotis,et al.  Enhanced movie content similarity based on textual, auditory and visual information , 2017, Expert Syst. Appl..

[26]  HodoshMicah,et al.  Framing image description as a ranking task , 2013 .

[27]  Dinesh Manocha,et al.  STEP: Spatial Temporal Graph Convolutional Networks for Emotion Perception from Gaits , 2019, AAAI.

[28]  Honggang Qi,et al.  Contrast Enhancement Estimation for Digital Image Forensics , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[29]  Hyeonseong Jeon,et al.  FakeTalkerDetect: Effective and Practical Realistic Neural Talking Head Detection with a Highly Unbalanced Dataset , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[30]  Kanchan Bahirat,et al.  ADD-FAR: attacked driving dataset for forensics analysis and research , 2019, MMSys.

[31]  Lisa Feldman Barrett,et al.  Handbook of Research Methods in Social and Personality Psychology: Inducing and Measuring Emotion and Affect , 2014 .

[32]  Erik Cambria,et al.  A Deep Learning Approach for Multimodal Deception Detection , 2018, CICLing.

[33]  Kah Phooi Seng,et al.  Affect Recognition for Web 2.0 Intelligent E-Tutoring Systems: Exploration of Students’ Emotional Feedback , 2013 .

[34]  Paolo Bestagini,et al.  Multimedia Forensics , 2019, ACM Multimedia.

[35]  Michael S. Beauchamp,et al.  Mouth and Voice: A Relationship between Visual and Auditory Preference in the Human Superior Temporal Sulcus , 2017, The Journal of Neuroscience.

[36]  R. B. Knapp,et al.  Physiological signals and their use in augmenting emotion recognition for human-machine interaction , 2011 .

[37]  Jean-Philippe Thiran,et al.  Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition , 2008, ICMI '08.

[38]  Andreas Rössler,et al.  FaceForensics++: Learning to Detect Manipulated Facial Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Sébastien Marcel,et al.  Speaker Inconsistency Detection in Tampered Video , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[40]  Sebastiano Battiato,et al.  Multimedia Forensics: discovering the history of multimedia contents , 2016, CompSysTech.

[41]  Edward J. Delp,et al.  Deepfake Video Detection Using Recurrent Neural Networks , 2018, 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[42]  K. Scherer,et al.  Vocal expression of emotion. , 2003 .

[43]  Dinesh Manocha,et al.  The Liar's Walk: Detecting Deception with Gait and Gesture , 2019, ArXiv.

[44]  Simon Lucey,et al.  Face alignment through subspace constrained mean-shifts , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[45]  Christian Riess,et al.  Exploiting Visual Artifacts to Expose Deepfakes and Face Manipulations , 2019, 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW).

[46]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[47]  Honggang Qi,et al.  Celeb-DF: A New Dataset for DeepFake Forensics , 2019, ArXiv.

[48]  Junichi Yamagishi,et al.  MesoNet: a Compact Facial Video Forgery Detection Network , 2018, 2018 IEEE International Workshop on Information Forensics and Security (WIFS).

[49]  Hatice Gunes,et al.  Bi-modal emotion recognition from expressive face and body gestures , 2007, J. Netw. Comput. Appl..

[50]  P. Ekman,et al.  Facial signs of emotional experience. , 1980 .

[51]  Peter Robinson,et al.  OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[52]  Xin Yang,et al.  Exposing Deep Fakes Using Inconsistent Head Poses , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Chen Change Loy,et al.  DeeperForensics-1.0: A Large-Scale Dataset for Real-World Face Forgery Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Belhassen Bayar,et al.  Learning Unified Deep-Features for Multiple Forensic Tasks , 2018, IH&MMSec.

[55]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[56]  Costanza Navarretta,et al.  Individuality in Communicative Bodily Behaviours , 2011, COST 2102 Training School.

[57]  Premkumar Natarajan,et al.  Recurrent Convolutional Strategies for Face Manipulation Detection in Videos , 2019, CVPR Workshops.

[58]  J. Schwartz,et al.  Seeing to hear better: evidence for early audio-visual interactions in speech identification , 2004, Cognition.