Emotions Don't Lie: A Deepfake Detection Method using Audio-Visual Affective Cues

We present a learning-based multimodal method for detecting real and deepfake videos. To maximize information for learning, we extract and analyze the similarity between the two audio and visual modalities from within the same video. Additionally, we extract and compare affective cues corresponding to emotion from the two modalities within a video to infer whether the input video is "real" or "fake". We propose a deep learning network, inspired by the Siamese network architecture and the triplet loss. To validate our model, we report the AUC metric on two large-scale, audio-visual deepfake detection datasets, DeepFake-TIMIT Dataset and DFDC. We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets, respectively.

[1]  D A Sanders,et al.  The relative contribution of visual and auditory components of speech to speech intelligibility as a function of three conditions of frequency distortion. , 1971, Journal of speech and hearing research.

[2]  P. Ekman,et al.  Facial signs of emotional experience. , 1980 .

[3]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[4]  Conrad Sanderson,et al.  The VidTIMIT Database , 2002 .

[5]  J. Schwartz,et al.  Seeing to hear better: evidence for early audio-visual interactions in speech identification , 2004, Cognition.

[6]  Kostas Karpouzis,et al.  Emotion Analysis in Man-Machine Interaction Systems , 2004, MLMI.

[7]  Nicu Sebe,et al.  Affective multimodal human-computer interaction , 2005, ACM Multimedia.

[8]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[9]  Shaogang Gong,et al.  Beyond Facial Expressions: Learning Human Emotion from Body Gestures , 2007, BMVC.

[10]  Hatice Gunes,et al.  Bi-modal emotion recognition from expressive face and body gestures , 2007, J. Netw. Comput. Appl..

[11]  John J. Foxe,et al.  Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments. , 2006, Cerebral cortex.

[12]  Jean-Philippe Thiran,et al.  Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition , 2008, ICMI '08.

[13]  Rob Miller,et al.  VizWiz: nearly real-time answers to visual questions , 2010, UIST.

[14]  Ke Chen,et al.  Extracting Speaker-Specific Information with a Regularized Siamese Deep Network , 2011, NIPS.

[15]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[16]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Theodoros Giannakopoulos pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis , 2015, PloS one.

[18]  Peter Robinson,et al.  OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[19]  Chuang Wang,et al.  The influence of affective cues on positive emotion in predicting instant information sharing on microblogs: Gender as a moderator , 2017, Inf. Process. Manag..

[20]  Michael S. Beauchamp,et al.  Mouth and Voice: A Relationship between Visual and Auditory Preference in the Human Superior Temporal Sulcus , 2017, The Journal of Neuroscience.

[21]  Larry S. Davis,et al.  Two-Stream Neural Networks for Tampered Face Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[22]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[23]  Erik Cambria,et al.  A Deep Learning Approach for Multimodal Deception Detection , 2018, CICLing.

[24]  Edward J. Delp,et al.  Deepfake Video Detection Using Recurrent Neural Networks , 2018, 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[25]  Sébastien Marcel,et al.  Speaker Inconsistency Detection in Tampered Video , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[26]  Junichi Yamagishi,et al.  MesoNet: a Compact Facial Video Forgery Detection Network , 2018, 2018 IEEE International Workshop on Information Forensics and Security (WIFS).

[27]  Erik Cambria,et al.  Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[28]  Sébastien Marcel,et al.  DeepFakes: a New Threat to Face Recognition? Assessment and Detection , 2018, ArXiv.

[29]  Siwei Lyu,et al.  Exposing DeepFake Videos By Detecting Face Warping Artifacts , 2018, CVPR Workshops.

[30]  Christian Riess,et al.  Exploiting Visual Artifacts to Expose Deepfakes and Face Manipulations , 2019, 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW).

[31]  Cristian Canton-Ferrer,et al.  The Deepfake Detection Challenge (DFDC) Preview Dataset , 2019, ArXiv.

[32]  Junichi Yamagishi,et al.  Capsule-forensics: Using Capsule Networks to Detect Forged Images and Videos , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Premkumar Natarajan,et al.  Recurrent Convolutional Strategies for Face Manipulation Detection in Videos , 2019, CVPR Workshops.

[34]  Honggang Qi,et al.  Celeb-DF: A New Dataset for DeepFake Forensics , 2019, ArXiv.

[35]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Andreas Rössler,et al.  FaceForensics++: Learning to Detect Manipulated Facial Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Hyeonseong Jeon,et al.  FakeTalkerDetect: Effective and Practical Realistic Neural Talking Head Detection with a Highly Unbalanced Dataset , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[38]  Junichi Yamagishi,et al.  Multi-task Learning for Detecting and Segmenting Manipulated Facial Images and Videos , 2019, 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS).

[39]  Xin Yang,et al.  Exposing Deep Fakes Using Inconsistent Head Poses , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Chen Change Loy,et al.  DeeperForensics-1.0: A Large-Scale Dataset for Real-World Face Forgery Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Aniket Bera,et al.  M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues , 2019, AAAI.