Video Transformer for Deepfake Detection with Incremental Learning

Face forgery by deepfake is widely spread over the internet and this raises severe societal concerns. In this paper, we propose a novel video transformer with incremental learning for detecting deepfake videos. To better align the input face images, we use a 3D face reconstruction method to generate UV texture from a single input face image. The aligned face image can also provide pose, eyes blink and mouth movement information that cannot be perceived in the UV texture image, so we use both face images and their UV texture maps to extract the image features. We present an incremental learning strategy to fine-tune the proposed model on a smaller amount of data and achieve better deepfake detection performance. The comprehensive experiments on various public deepfake datasets demonstrate that the proposed video transformer model with incremental learning achieves state-of-the-art performance in the deepfake video detection task with enhanced feature learning from the sequenced data.

[1]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[2]  Noam Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.

[3]  Dinesh Manocha,et al.  Emotions Don't Lie: An Audio-Visual Deepfake Detection Method using Affective Cues , 2020, ACM Multimedia.

[4]  Luisa Verdoliva,et al.  Media Forensics and DeepFakes: An Overview , 2020, IEEE Journal of Selected Topics in Signal Processing.

[5]  Andreas Rössler,et al.  FaceForensics++: Learning to Detect Manipulated Facial Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[7]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[8]  Siwei Lyu,et al.  Exposing DeepFake Videos By Detecting Face Warping Artifacts , 2018, CVPR Workshops.

[9]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[10]  Patrik Huber,et al.  3D Morphable Models: The Face, Ear and Head , 2020 .

[11]  Ser-Nam Lim,et al.  Detecting Deep-Fake Videos from Appearance and Behavior , 2020, 2020 IEEE International Workshop on Information Forensics and Security (WIFS).

[12]  Lin Su,et al.  ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.

[13]  Ilke Demir,et al.  FakeCatcher: Detection of Synthetic Portrait Videos using Biological Signals , 2019, IEEE transactions on pattern analysis and machine intelligence.

[14]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[15]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[16]  Cordelia Schmid,et al.  End-to-End Incremental Learning , 2018, ECCV.

[17]  Yisroel Mirsky,et al.  The Creation and Detection of Deepfakes , 2020, ACM Comput. Surv..

[18]  Paolo Bestagini,et al.  Video Face Manipulation Detection Through Ensemble of CNNs , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[19]  Belhassen Bayar,et al.  A Deep Learning Approach to Universal Image Manipulation Detection Using a New Convolutional Layer , 2016, IH&MMSec.

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21]  William Smith,et al.  A 3D Morphable Model of Craniofacial Shape and Texture Variation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Sean Franklin,et al.  Deepfake Detection using Spatiotemporal Convolutional Networks , 2020, ArXiv.

[24]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Nick Pears,et al.  Statistical Modeling of Craniofacial Shape and Texture , 2019, International Journal of Computer Vision.

[27]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[28]  Stan Z. Li,et al.  Face Forgery Detection by 3D Decomposition , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[31]  Fahad Shahbaz Khan,et al.  Transformers in Vision: A Survey , 2021, ACM Comput. Surv..

[32]  Junichi Yamagishi,et al.  Capsule-forensics: Using Capsule Networks to Detect Forged Images and Videos , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Junichi Yamagishi,et al.  MesoNet: a Compact Facial Video Forgery Detection Network , 2018, 2018 IEEE International Workshop on Information Forensics and Security (WIFS).

[34]  Premkumar Natarajan,et al.  Recurrent Convolutional Strategies for Face Manipulation Detection in Videos , 2019, CVPR Workshops.

[35]  Hang Dai,et al.  Commands for Autonomous Vehicles by Progressively Stacking Visual-Linguistic Representations , 2020, ECCV Workshops.

[36]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[37]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[38]  Jianzhu Guo,et al.  Towards Fast, Accurate and Stable 3D Dense Face Alignment , 2020, ECCV.

[39]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[40]  Jing Dong,et al.  On the generalization of GAN image forensics , 2019, CCBR.

[41]  Baining Guo,et al.  Face X-Ray for More General Face Forgery Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Xin Yang,et al.  Exposing Deep Fakes Using Inconsistent Head Poses , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Hang Dai,et al.  C4AV: Learning Cross-Modal Representations from Transformers , 2020, ECCV Workshops.

[44]  Sébastien Marcel,et al.  DeepFakes: a New Threat to Face Recognition? Assessment and Detection , 2018, ArXiv.

[45]  Cristian Canton Ferrer,et al.  The DeepFake Detection Challenge (DFDC) Dataset. , 2020 .

[46]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.