Do Deepfakes Feel Emotions? A Semantic Approach to Detecting Deepfakes Via Emotional Inconsistencies

Recent advances in deep learning and computer vision have spawned a new class of media forgeries known as deepfakes, which typically consist of artificially generated human faces or voices. The creation and distribution of deepfakes raise many legal and ethical concerns. As a result, the ability to distinguish between deepfakes and authentic media is vital. While deepfakes can create plausible video and audio, it may be challenging for them to to generate content that is consistent in terms of high-level semantic features, such as emotions. Unnatural displays of emotion, measured by features such as valence and arousal, can provide significant evidence that a video has been synthesized. In this paper, we propose a novel method for detecting deepfakes of a human speaker using the emotion predicted from the speaker’s face and voice. The proposed technique leverages Long Short-Term Memory (LSTM) networks that predict emotion from audio and video Low-Level Descriptors (LLDs). Predicted emotion in time is used to classify videos as authentic or deepfakes through an additional supervised classifier.

[1]  Patrick Pérez,et al.  State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications , 2018, Comput. Graph. Forum.

[2]  Hao Li,et al.  Protecting World Leaders Against Deep Fakes , 2019, CVPR Workshops.

[3]  Tomi Kinnunen,et al.  ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection , 2019, INTERSPEECH.

[4]  P. Ekman,et al.  What the face reveals : basic and applied studies of spontaneous expression using the facial action coding system (FACS) , 2005 .

[5]  Erik Cambria,et al.  Context-Dependent Sentiment Analysis in User-Generated Videos , 2017, ACL.

[6]  D. Rubin,et al.  A comparison of dimensional models of emotion: Evidence from emotions, prototypical events, autobiographical memories, and words , 2009, Memory.

[7]  Paolo Bestagini,et al.  Video Face Manipulation Detection Through Ensemble of CNNs , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[8]  Erik Cambria,et al.  A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[9]  Maneesh Agrawala,et al.  Detecting Deep-Fake Videos from Phoneme-Viseme Mismatches , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10]  Christian Riess,et al.  Exploiting Visual Artifacts to Expose Deepfakes and Face Manipulations , 2019, 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW).

[11]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[12]  Haizhou Li,et al.  Advances in anti-spoofing: from the perspective of ASVspoof challenges , 2020, APSIPA Transactions on Signal and Information Processing.

[13]  Matthew Day,et al.  Emotion recognition with boosted tree classifiers , 2013, ICMI '13.

[14]  Siwei Lyu,et al.  Detecting AI-Synthesized Speech Using Bispectral Analysis , 2019, CVPR Workshops.

[15]  Gaurav Sahu,et al.  Multimodal Speech Emotion Recognition and Ambiguity Resolution , 2019, ArXiv.

[16]  Peter Robinson,et al.  3D Constrained Local Model for rigid and non-rigid facial tracking , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Siwei Lyu,et al.  In Ictu Oculi: Exposing AI Created Fake Videos by Detecting Eye Blinking , 2018, 2018 IEEE International Workshop on Information Forensics and Security (WIFS).

[18]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[19]  J. Russell A circumplex model of affect. , 1980 .

[20]  Sten Hanke,et al.  Emotion Recognition from Physiological Signal Analysis: A Review , 2019, BRAINS/WS-AFFIN@AmI.

[21]  Joanna J. Bryson,et al.  THE CONCEPTUALISATION OF EMOTION QUALIA: SEMANTIC CLUSTERING OF EMOTIONAL TWEETS , 2014 .

[22]  Peter Robinson,et al.  OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[23]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[24]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Luisa Verdoliva,et al.  Media Forensics and DeepFakes: An Overview , 2020, IEEE Journal of Selected Topics in Signal Processing.

[26]  Erik Cambria,et al.  Multi-level Multiple Attentions for Contextual Multimodal Sentiment Analysis , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[27]  Maja Pantic,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING , 2022 .

[28]  Kai Yu,et al.  End-to-end spoofing detection with raw waveform CLDNNS , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Masato Akagi,et al.  Evaluation of Error and Correlation-Based Loss Functions For Multitask Learning Dimensional Speech Emotion Recognition , 2020 .

[30]  Junichi Yamagishi,et al.  MesoNet: a Compact Facial Video Forgery Detection Network , 2018, 2018 IEEE International Workshop on Information Forensics and Security (WIFS).

[31]  Erik Cambria,et al.  Affective Computing and Sentiment Analysis , 2016, IEEE Intelligent Systems.

[32]  Justus Thies,et al.  Deferred Neural Rendering: Image Synthesis using Neural Textures , 2019 .

[33]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Sébastien Marcel,et al.  DeepFakes: a New Threat to Face Recognition? Assessment and Detection , 2018, ArXiv.

[35]  Tomi Kinnunen,et al.  A comparison of features for synthetic speech detection , 2015, INTERSPEECH.

[36]  Siwei Lyu,et al.  Exposing DeepFake Videos By Detecting Face Warping Artifacts , 2018, CVPR Workshops.

[37]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[38]  Omar Sharif,et al.  A Literature Review on Emotion Recognition Using Various Methods , 2017 .

[39]  Matthew C. Stamm,et al.  Adversarial Multimedia Forensics: Overview and Challenges Ahead , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[40]  Qin Jin,et al.  Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction , 2016, ACM Multimedia.

[41]  Rui Xia,et al.  Using i-Vector Space Model for Emotion Recognition , 2012, INTERSPEECH.

[42]  Stacy Marsella,et al.  Computationally modeling human emotion , 2014, CACM.

[43]  N. Remmington,et al.  Reexamining the circumplex model of affect. , 2000, Journal of personality and social psychology.

[44]  P. Alam ‘A’ , 2021, Composites Engineering: An A–Z Guide.

[45]  Xin Yang,et al.  Exposing Deep Fakes Using Inconsistent Head Poses , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Andreas Rössler,et al.  FaceForensics++: Learning to Detect Manipulated Facial Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Edward J. Delp,et al.  Deepfake Video Detection Using Recurrent Neural Networks , 2018, 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[48]  Vincenzo Lipari,et al.  "Hello? Who Am I Talking to?" A Shallow CNN Approach for Human vs. Bot Speech Classification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Fabien Ringeval,et al.  AVEC'19: Audio/Visual Emotion Challenge and Workshop , 2019, ACM Multimedia.

[50]  Kiran George,et al.  Bio-signal based emotion detection device , 2016, 2016 IEEE 13th International Conference on Wearable and Implantable Body Sensor Networks (BSN).

[51]  Zheng Lian,et al.  Multi-modal Continuous Dimensional Emotion Recognition Using Recurrent Neural Network and Self-Attention Mechanism , 2020, MuSe @ ACM Multimedia.