Vision Transformer-Based Video Hashing Retrieval for Tracing the Source of Fake Videos

With the increasing negative impact of fake videos on individuals and society, it is crucial to detect different types of forgeries. Existing forgery detection methods often output a probability value, which lacks interpretability and reliability. In this paper, we propose a source-tracing-based solution to find the original real video of a fake video, which can provide more reliable results in practical situations. However, directly applying retrieval methods to traceability tasks is infeasible since traceability tasks require finding the unique source video from a large number of real videos, while retrieval methods are typically used to find similar videos. In addition, training an effective hashing center to distinguish similar real videos is challenging. To address the above issues, we introduce a novel loss function, hash triplet loss, to capture fine-grained features with subtle differences. Extensive experiments show that our method outperforms state-of-the-art methods on multiple datasets of object removal (video inpainting), object addition (video splicing), and object swapping (face swapping), demonstrating excellent robustness and cross-dataset performance. The effectiveness of the hash triplet loss for nondifferentiable optimization problems is validated through experiments in similar video scenes.

[1]  Ming Li,et al.  Source Tracing: Detecting Voice Spoofing , 2022, 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[2]  Meng Li,et al.  A Comparative Study on Physical and Perceptual Features for Deepfake Audio Detection , 2022, DDAM@MM.

[3]  J. Tao,et al.  Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features , 2022, DDAM@MM.

[4]  Jilin Li,et al.  Delving into the Local: Dynamic Inconsistency Learning for DeepFake Video Detection , 2022, AAAI.

[5]  Xuemiao Xu,et al.  DLFormer: Discrete Latent Transformer for Video Inpainting , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Z. Li,et al.  Towards An End-to-End Framework for Flow-Guided Video Inpainting , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Zehuan Yuan,et al.  MetaFormer: A Unified Meta Framework for Fine-Grained Recognition , 2022, ArXiv.

[8]  Hanjiang Lai,et al.  Deep Listwise Triplet Hashing for Fine-Grained Image Retrieval , 2021, IEEE Transactions on Image Processing.

[9]  Xirong Li,et al.  MVSS-Net: Multi-View Multi-Scale Supervised Networks for Image Manipulation Detection , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Xianfeng Zhao,et al.  Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos , 2021, Security and Communication Networks.

[11]  Simon S. Woo,et al.  ADD: Frequency Attention and Multi-View based Knowledge Distillation to Detect Low-Quality Compressed Deepfake Images , 2021, AAAI.

[12]  Zechao Li,et al.  Sub-Region Localized Hashing for Fine-Grained Image Retrieval , 2021, IEEE Transactions on Image Processing.

[13]  Jie Zhou,et al.  Frequency-Aware Spatiotemporal Transformers for Video Inpainting Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  J. Ni,et al.  Self-supervised Domain Adaptation for Forgery Localization of JPEG Compressed Images , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Jifeng Dai,et al.  FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Lizhuang Ma,et al.  Spatiotemporal Inconsistency Learning for DeepFake Video Detection , 2021, ACM Multimedia.

[17]  Hang Dai,et al.  Video Transformer for Deepfake Detection with Incremental Learning , 2021, ACM Multimedia.

[18]  Kai Han,et al.  CMT: Convolutional Neural Networks Meet Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  P. Luo,et al.  PVT v2: Improved baselines with Pyramid Vision Transformer , 2021, Computational Visual Media.

[20]  Qiuhong Ke,et al.  Noise Doesn't Lie: Towards Universal Detection of Deep Inpainting , 2021, IJCAI.

[21]  Jason J. Corso,et al.  The DEVIL is in the Details: A Diagnostic Evaluation Benchmark for Video Inpainting , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Na Ruan,et al.  Improving the Efficiency and Robustness of Deepfakes Detection through Precise Geometric Features , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Cordelia Schmid,et al.  ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Junchi Yan,et al.  Generalizing Face Forgery Detection with High-frequency Features , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Yongdong Zhang,et al.  Frequency-aware Discriminative Feature Learning Supervised by Single-Center Loss for Face Forgery Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Nenghai Yu,et al.  Multi-attentional Deepfake Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Nenghai Yu,et al.  Spatial-Phase Shallow Learning: Rethinking Face Forgery Detection in Frequency Domain , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Xiang Li,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Yifan Jiang,et al.  TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up , 2021, NeurIPS.

[30]  Weisheng Li,et al.  IDHashGAN: Deep Hashing With Generative Adversarial Nets for Incomplete Data Retrieval , 2021, IEEE Transactions on Multimedia.

[31]  Ser-Nam Lim,et al.  Deep Video Inpainting Detection , 2021, BMVC.

[32]  M. Nießner,et al.  ID-Reveal: Identity-aware DeepFake Video Detection , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Gautam Srivastava,et al.  Tracing the Source of Fake News using a Scalable Blockchain Distributed Network , 2020, 2020 IEEE 17th International Conference on Mobile Ad Hoc and Sensor Systems (MASS).

[34]  Yu-Gang Jiang,et al.  WildDeepfake: A Challenging Real-World Dataset for Deepfake Detection , 2020, ACM Multimedia.

[35]  Chen Gao,et al.  Flow-edge Guided Video Completion , 2020, ECCV.

[36]  Pramod K. Srivastava,et al.  Defensive Modeling of Fake News Through Online Social Networks , 2020, IEEE Transactions on Computational Social Systems.

[37]  Hongyang Chao,et al.  Learning Joint Spatial-Temporal Transformations for Video Inpainting , 2020, ECCV.

[38]  Daiheng Gao,et al.  DeepFaceLab: A simple, flexible and extensible face swapping framework , 2020, ArXiv.

[39]  Ling Shao,et al.  Auto-Encoding Twin-Bottleneck Hashing , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Lei Ma,et al.  FakeLocator: Robust Localization of GAN-Based Face Manipulations , 2020, IEEE Transactions on Information Forensics and Security.

[41]  Fang Wen,et al.  Face X-Ray for More General Face Forgery Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Feiyue Huang,et al.  TEINet: Towards an Efficient Architecture for Video Recognition , 2019, AAAI.

[43]  Jiwu Huang,et al.  Localization of Deep Inpainting Using High-Pass Fully Convolutional Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Siwei Lyu,et al.  Celeb-DF: A Large-Scale Challenging Dataset for DeepFake Forensics , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Seoung Wug Oh,et al.  Copy-and-Paste Networks for Deep Video Inpainting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Seoung Wug Oh,et al.  Onion-Peel Networks for Deep Video Completion , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Francis E. H. Tay,et al.  Central Similarity Quantization for Efficient Image and Video Retrieval , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Bolei Zhou,et al.  Deep Flow-Guided Video Inpainting , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  In So Kweon,et al.  Deep Video Inpainting , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Xianfeng Zhao,et al.  Adversarial Learning for Constrained Image Splicing Detection and Localization Based on Atrous Convolution , 2019, IEEE Transactions on Information Forensics and Security.

[51]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Andreas Rössler,et al.  FaceForensics++: Learning to Detect Manipulated Facial Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Xianfeng Zhao,et al.  A deep learning approach to patch-based image inpainting forensics , 2018, Signal Process. Image Commun..

[54]  Minzheng Jia,et al.  Tracing the Source of News Based on Blockchain , 2018, 2018 IEEE/ACIS 17th International Conference on Computer and Information Science (ICIS).

[55]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[56]  Marina Del Rey,et al.  Deep Matching and Validation Network: An End-to-End Solution to Constrained Image Splicing Localization and Detection , 2017, ACM Multimedia.

[57]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Lucas Theis,et al.  Fast Face-Swap Using Convolutional Neural Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[59]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Christine Guillemot,et al.  Video Inpainting With Short-Term Windows: Application to Object Removal and Error Concealment , 2015, IEEE Transactions on Image Processing.

[62]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[63]  Cairong Zhao,et al.  ISTVT: Interpretable Spatial-Temporal Video Transformer for Deepfake Detection , 2023, IEEE Transactions on Information Forensics and Security.

[64]  Jian Liang,et al.  Masked Relation Learning for DeepFake Detection , 2023, IEEE Transactions on Information Forensics and Security.

[65]  Xianfeng Zhao,et al.  Visual Explanations for Exposing Potential Inconsistency of Deepfakes , 2022, IWDW.

[66]  Jun Liu,et al.  Robust Image Forgery Detection Against Transmission Over Online Social Networks , 2022, IEEE Transactions on Information Forensics and Security.

[67]  Xilin Chen,et al.  HRFormer: High-Resolution Vision Transformer for Dense Predict , 2021, NeurIPS.

[68]  Jiachen Yang,et al.  MTD-Net: Learning to Detect Deepfakes Images by Multi-Scale Texture Difference , 2021, IEEE Transactions on Information Forensics and Security.

[69]  Yongjian Hu,et al.  Exposing Deepfake Videos with Spatial, Frequency and Multi-scale Temporal Artifacts , 2021, IWDW.

[70]  Ryan Griebenow,et al.  Image Splicing Detection , 2017 .