Towards Spatio-temporal Collaborative Learning: An End-to-End Deepfake Video Detection Framework

With the rapid development of facial tampering techniques, the deepfake detection task has attracted widespread social concerns. Most existing video-based methods adopt temporal convolution to learn temporal discontinuities directly, where they might neglect to explore both local detail mutation and inconsistent global expression semantics in the temporal dimension. This makes it difficult to learn more discriminative forgery cues. To mitigate this issue, we introduce a novel deepfake video detection framework specifically designed to capture fine-grained traces of tampering. Concretely, we first present a Multi-layered Feature Extraction module (MFE) that constructs comprehensive spatio-temporal representations by stitching different levels of features together. Afterward, we propose a Bidirectional temporal Artifact Enhancement module (BAE), which exploits local differences between adjacent frames to enhance frame-level features. Moreover, we present a Cross temporal Stride Aggregation strategy (CSA) to mine inconsistent global semantics and adaptively obtain multi-timescale representations. Extensive experiments on several benchmarks demonstrate that the proposed method outperforms state-of-the-art performance compared to other competitive approaches.

[1]  P. Saikia,et al.  A Hybrid CNN-LSTM model for Video Deepfake Detection by Leveraging Optical Flow Features , 2022, 2022 International Joint Conference on Neural Networks (IJCNN).

[2]  Chao Ma,et al.  End-to-End Reconstruction-Classification Learning for Face Forgery Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Lizhuang Ma,et al.  Spatiotemporal Inconsistency Learning for DeepFake Video Detection , 2021, ACM Multimedia.

[4]  Jianmin Bao,et al.  Exploring Temporal Coherence for More General Video Face Forgery Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Rongrong Ji,et al.  Local Relation Learning for Face Forgery Detection , 2021, AAAI.

[6]  Yu-Gang Jiang,et al.  M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection , 2021, ICMR.

[7]  Weihong Deng,et al.  Representative Forgery Mining for Fake Face Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Na Ruan,et al.  Improving the Efficiency and Robustness of Deepfakes Detection through Precise Geometric Features , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Junchi Yan,et al.  Generalizing Face Forgery Detection with High-frequency Features , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Nenghai Yu,et al.  Multi-attentional Deepfake Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Limin Wang,et al.  TDN: Temporal Difference Networks for Efficient Action Recognition , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Maja Pantic,et al.  Lips Don't Lie: A Generalisable and Robust Approach to Face Forgery Detection , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Chuang Gan,et al.  MVFNet: Multi-View Fusion Network for Efficient Video Recognition , 2020, AAAI.

[14]  Yuan He,et al.  Sharp Multiple Instance Learning for DeepFake Video Detection , 2020, ACM Multimedia.

[15]  Lu Sheng,et al.  Thinking in Frequency: Face Forgery Detection by Mining Frequency-aware Clues , 2020, ECCV.

[16]  Lei Ma,et al.  DeepRhythm: Exposing DeepFakes with Attentional Visual Heartbeat Rhythms , 2020, ACM Multimedia.

[17]  Xi Wu,et al.  SSTNet: Detecting Manipulated Faces Through Spatial, Steganalysis and Temporal Features , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  A. Morales,et al.  DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection , 2020, Inf. Fusion.

[19]  Jung-Woo Ha,et al.  StarGAN v2: Diverse Image Synthesis for Multiple Domains , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Cristian Canton-Ferrer,et al.  The Deepfake Detection Challenge (DFDC) Preview Dataset , 2019, ArXiv.

[21]  Siwei Lyu,et al.  Celeb-DF: A Large-Scale Challenging Dataset for DeepFake Forensics , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Zhe L. Lin,et al.  Semantic Component Decomposition for Face Attribute Manipulation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Xiao Liu,et al.  STGAN: A Unified Selective Transfer Network for Arbitrary Image Attribute Editing , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Andreas Rössler,et al.  FaceForensics++: Learning to Detect Manipulated Facial Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Edward J. Delp,et al.  Deepfake Video Detection Using Recurrent Neural Networks , 2018, 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[26]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).