M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection

The widespread dissemination of forged images generated by Deepfake techniques has posed a serious threat to the trustworthiness of digital information. This demands effective approaches that can detect perceptually convincing Deepfakes generated by advanced manipulation techniques. Most existing approaches combat Deepfakes with deep neural networks by mapping the input image to a binary prediction without capturing the consistency among different pixels. In this paper, we aim to capture the subtle manipulation artifacts at different scales for Deepfake detection. We achieve this with transformer models, which have recently demonstrated superior performance in modeling dependencies between pixels for a variety of recognition tasks in computer vision. In particular, we introduce a Multi-modal Multi-scale TRansformer (M2TR), which uses a multi-scale transformer that operates on patches of different sizes to detect the local inconsistency at different spatial levels. To improve the detection results and enhance the robustness of our method to image compression, M2TR also takes frequency information, which is further combined with RGB features using a cross modality fusion module. Developing and evaluating Deepfake detection methods requires large-scale datasets. However, we observe that samples in existing benchmarks contain severe artifacts and lack diversity. This motivates us to introduce a high-quality Deepfake dataset, SR-DF, which consists of 4,000 DeepFake videos generated by state-of-the-art face swapping and facial reenactment methods. On three Deepfake datasets, we conduct extensive experiments to verify the effectiveness of the proposed method, which outperforms state-of-the-art Deepfake detection methods.

[1]  Esa Rahtu,et al.  ICface: Interpretable and Controllable Face Reenactment Using GANs , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[2]  Jessica J. Fridrich,et al.  Rich Models for Steganalysis of Digital Images , 2012, IEEE Transactions on Information Forensics and Security.

[3]  Mehran Ebrahimi,et al.  EdgeConnect: Generative Image Inpainting with Adversarial Edge Learning , 2019, ArXiv.

[4]  Cristian Canton Ferrer,et al.  The DeepFake Detection Challenge (DFDC) Dataset. , 2020 .

[5]  Erik Reinhard,et al.  Color Transfer between Images , 2001, IEEE Computer Graphics and Applications.

[6]  Baining Guo,et al.  Face X-Ray for More General Face Forgery Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[8]  Lu Sheng,et al.  Thinking in Frequency: Face Forgery Detection by Mining Frequency-aware Clues , 2020, ECCV.

[9]  Stefanos Zafeiriou,et al.  Head2Head: Video-based Neural Head Synthesis , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[10]  Nicu Sebe,et al.  First Order Motion Model for Image Animation , 2020, NeurIPS.

[11]  Junichi Yamagishi,et al.  Use of a Capsule Network to Detect Fake Images and Videos , 2019, ArXiv.

[12]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Jan Kautz,et al.  PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Junichi Yamagishi,et al.  MesoNet: a Compact Facial Video Forgery Detection Network , 2018, 2018 IEEE International Workshop on Information Forensics and Security (WIFS).

[17]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[18]  Xin Yang,et al.  Exposing Deep Fakes Using Inconsistent Head Poses , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Fu-En Yang,et al.  Learning Identity-Invariant Motion Representations for Cross-ID Face Reenactment , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Ying Huang,et al.  Deep Frequent Spatial Temporal Learning for Face Anti-Spoofing , 2020, ArXiv.

[22]  Sébastien Marcel,et al.  DeepFakes: a New Threat to Face Recognition? Assessment and Detection , 2018, ArXiv.

[23]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Tal Hassner,et al.  FSGAN: Subject Agnostic Face Swapping and Reenactment , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Nenghai Yu,et al.  Coherent Online Video Style Transfer , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[28]  Francesc Moreno-Noguer,et al.  GANimation: Anatomically-aware Facial Animation from a Single Image , 2018, ECCV.

[29]  Qifeng Chen,et al.  Blind Video Temporal Consistency via Deep Video Prior , 2020, NeurIPS.

[30]  Shigeo Morishima,et al.  RSGAN: face swapping and editing using face and hair representation in latent spaces , 2018, SIGGRAPH Posters.

[31]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[32]  Chen Qian,et al.  ReenactGAN: Learning to Reenact Faces via Boundary Transfer , 2018, ECCV.

[33]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[34]  Justus Thies,et al.  Deferred Neural Rendering: Image Synthesis using Neural Textures , 2019 .

[35]  Junichi Yamagishi,et al.  Multi-task Learning for Detecting and Segmenting Manipulated Facial Images and Videos , 2019, 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS).

[36]  Fang Wen,et al.  FaceShifter: Towards High Fidelity And Occlusion Aware Face Swapping , 2019, ArXiv.

[37]  Larry S. Davis,et al.  Two-Stream Neural Networks for Tampered Face Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[38]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[39]  Hyeonseong Jeon,et al.  FDFtNet: Facing Off Fake Images using Fake Detection Fine-tuning Network , 2020, SEC.

[40]  Siwei Lyu,et al.  Exposing DeepFake Videos By Detecting Face Warping Artifacts , 2018, CVPR Workshops.

[41]  Xiaojie Guo,et al.  LaFIn: Generative Landmark Guided Face Inpainting , 2019, ArXiv.

[42]  Hao Wang,et al.  Real-Time Neural Style Transfer for Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Wojciech Matusik,et al.  Video face replacement , 2011, ACM Trans. Graph..

[44]  Andrew Owens,et al.  CNN-Generated Images Are Surprisingly Easy to Spot… for Now , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Rongrong Ji,et al.  Local Relation Learning for Face Forgery Detection , 2021, AAAI.

[46]  Mario Fritz,et al.  Attributing Fake Images to GANs: Learning and Analyzing GAN Fingerprints , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Nenghai Yu,et al.  Multi-attentional Deepfake Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Solomon Atnafu,et al.  Deepfake Video Detection Using Convolutional Vision Transformer , 2021, ArXiv.

[49]  Margret Keuper,et al.  Unmasking DeepFakes with simple Features , 2019, ArXiv.

[50]  Andreas Rössler,et al.  FaceForensics++: Learning to Detect Manipulated Facial Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[51]  Davide Cozzolino,et al.  Recasting Residual-based Local Descriptors as Convolutional Neural Networks: an Application to Image Forgery Detection , 2017, IH&MMSec.

[52]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[53]  Liqing Zhang,et al.  DoveNet: Deep Image Harmonization via Domain Verification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.