DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion

We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD in short. Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video. This presents a generative modeling perspective, against previous discriminative learning manners. This capability is achieved by first diffusing the ground-truth proposals to random ones (i.e., the forward/noising process) and then learning to reverse the noising process (i.e., the backward/denoising process). Concretely, we establish the denoising process in the Transformer decoder (e.g., DETR) by introducing a temporal location query design with faster convergence in training. We further propose a cross-step selective conditioning algorithm for inference acceleration. Extensive evaluations on ActivityNet and THUMOS show that our DiffTAD achieves top performance compared to previous art alternatives. The code will be made available at https://github.com/sauradip/DiffusionTAD.

[1]  Xiatian Zhu,et al.  Generative Semantic Segmentation , 2023, ArXiv.

[2]  P. Luo,et al.  DiffusionDet: Diffusion Model for Object Detection , 2022, ArXiv.

[3]  T. Blundell,et al.  Structure-based Drug Design with Equivariant Diffusion Models , 2022, ArXiv.

[4]  Lingpeng Kong,et al.  DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models , 2022, ICLR.

[5]  David J. Fleet,et al.  A Generalist Framework for Panoptic Segmentation of Images and Videos , 2022, ArXiv.

[6]  Jong-Chul Ye,et al.  Diffusion Adversarial Representation Learning for Self-supervised Vessel Segmentation , 2022, ICLR.

[7]  Yaniv Taigman,et al.  Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.

[8]  Stan Z. Li,et al.  A Survey on Generative Diffusion Model , 2022, ArXiv.

[9]  Mao Ye,et al.  Diffusion-based Molecule Generation with Informative Prior Bridges , 2022, NeurIPS.

[10]  Zhongang Cai,et al.  MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Yu-Chiang Frank Wang,et al.  Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis , 2022, AAAI.

[12]  Yuanzhen Li,et al.  DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Geoffrey E. Hinton,et al.  Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning , 2022, ICLR.

[14]  A. Yuille,et al.  In Defense of Online Models for Video Instance Segmentation , 2022, ECCV.

[15]  Chao Weng,et al.  Diffsound: Discrete Diffusion Model for Text-to-Sound Generation , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Xiatian Zhu,et al.  Proposal-Free Temporal Action Detection via Global Segmentation Mask Learning , 2022, ECCV.

[17]  Jing Zhang,et al.  ReAct: Temporal Action Detection with Relational Queries , 2022, ECCV.

[18]  Yi Ren,et al.  ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech , 2022, ACM Multimedia.

[19]  D. Samaras,et al.  Diffusion models as plug-and-play priors , 2022, NeurIPS.

[20]  Brian L. Trippe,et al.  Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem , 2022, ICLR.

[21]  L. Wolf,et al.  Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models , 2022, Interspeech.

[22]  Emmanuel Asiedu Brempong,et al.  Denoising Pretraining for Semantic Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[23]  T. Jaakkola,et al.  Torsional Diffusion for Molecular Conformer Generation , 2022, NeurIPS.

[24]  Sung-Hoon Yoon,et al.  Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data , 2022, ArXiv.

[25]  Xiang Lisa Li,et al.  Diffusion-LM Improves Controllable Text Generation , 2022, NeurIPS.

[26]  Tudor Achim,et al.  Protein Structure and Sequence Generation with Equivariant Denoising Diffusion Probabilistic Models , 2022, ArXiv.

[27]  Frank Wood,et al.  Flexible Diffusion Modeling of Long Videos , 2022, NeurIPS.

[28]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[29]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[30]  David J. Fleet,et al.  Video Diffusion Models , 2022, NeurIPS.

[31]  Victor Garcia Satorras,et al.  Equivariant Diffusion for Molecule Generation in 3D , 2022, ICML.

[32]  S. Mandt,et al.  Diffusion Probabilistic Modeling for Video Generation , 2022, Entropy.

[33]  Pan Pan,et al.  RCL: Recurrent Continuous Localization for Temporal Action Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  S. Ermon,et al.  GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation , 2022, ICLR.

[35]  L. Ni,et al.  DN-DETR: Accelerate DETR Training by Introducing Query DeNoising , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yin Li,et al.  ActionFormer: Localizing Moments of Actions with Transformers , 2022, ECCV.

[37]  Hang Su,et al.  DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR , 2022, ICLR.

[38]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[39]  A. Voynov,et al.  Label-Efficient Semantic Segmentation with Diffusion Models , 2021, ICLR.

[40]  Philippe C. Cattin,et al.  Diffusion Models for Implicit Image Segmentation Ensembles , 2021, MIDL.

[41]  Lior Wolf,et al.  SegDiff: Image Segmentation with Diffusion Probabilistic Models , 2021, ArXiv.

[42]  Fang Wen,et al.  Vector Quantized Diffusion Model for Text-to-Image Synthesis , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  D. Lischinski,et al.  Blended Diffusion for Text-driven Editing of Natural Images , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Tao Xiang,et al.  Few-Shot Temporal Action Localization with Query Adaptive Transformer , 2021, BMVC.

[45]  Vincent Lepetit,et al.  1st Place Solution for the UVO Challenge on Image-based Open-World Segmentation 2021 , 2021, ArXiv.

[46]  Taesu Kim,et al.  EdiTTS: Score-based Editing for Controllable Text-to-Speech , 2021, INTERSPEECH.

[47]  Yingming Wang,et al.  Anchor DETR: Query Design for Transformer-Based Object Detection , 2021, 2109.07107.

[48]  Niamul Quader,et al.  Class Semantics-based Attention for Action Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Gang Hua,et al.  Enriching Local and Global Contexts for Temporal Action Localization , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Zeming Li,et al.  YOLOX: Exceeding YOLO Series in 2021 , 2021, ArXiv.

[51]  Rianne van den Berg,et al.  Structured Denoising Diffusion Models in Discrete State-Spaces , 2021, NeurIPS.

[52]  Hongxun Yao,et al.  Temporal Action Proposal Generation with Transformers , 2021, ArXiv.

[53]  Ziqiang Shi,et al.  It\^oTTS and It\^oWave: Linear Stochastic Differential Equation Is All You Need For Audio Generation , 2021, 2105.07583.

[54]  Tasnima Sadekova,et al.  Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech , 2021, ICML.

[55]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[56]  Bernard Ghanem,et al.  Low-Fidelity Video Encoder Optimization for Temporal Action Localization , 2021, NeurIPS.

[57]  Zeming Li,et al.  OTA: Optimal Transport Assignment for Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Wei Wu,et al.  Temporal Context Aggregation Network for Temporal Action Proposal Refinement , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Prafulla Dhariwal,et al.  Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[60]  Limin Wang,et al.  Relaxed Transformer Decoders for Direct Action Proposal Generation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[61]  Song Bai,et al.  Multi-shot Temporal Event Localization: a Benchmark , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[63]  Yi Jiang,et al.  Sparse R-CNN: End-to-End Object Detection with Learnable Proposals , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Bernard Ghanem,et al.  TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks , 2020, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[65]  T. Xiang,et al.  Boundary-sensitive Pre-training for Temporal Localization in Videos , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[66]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[67]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[68]  Wei Wu,et al.  BSN++: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation , 2020, AAAI.

[69]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[70]  Stefano Ermon,et al.  Improved Techniques for Training Score-Based Generative Models , 2020, NeurIPS.

[71]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[72]  Yuning Jiang,et al.  SOLO: Segmenting Objects by Locations , 2019, ECCV.

[73]  Ali K. Thabet,et al.  G-TAD: Sub-Graph Localization for Temporal Action Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Shilei Wen,et al.  BMN: Boundary-Matching Network for Temporal Action Proposal Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[75]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[76]  Tao Mei,et al.  Gaussian Temporal Awareness Networks for Action Localization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[77]  Hao Chen,et al.  FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[78]  Ming Yang,et al.  BSN: Boundary Sensitive Network for Temporal Action Proposal Generation , 2018, ECCV.

[79]  Rahul Sukthankar,et al.  Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[80]  Carsten Rother,et al.  Panoptic Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Bernard Ghanem,et al.  SST: Single-Stream Temporal Action Proposals , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[83]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[84]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, International Journal of Computer Vision.

[85]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[86]  R. Nevatia,et al.  TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[87]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[88]  Haroon Idrees,et al.  The THUMOS challenge on action recognition for videos "in the wild" , 2016, Comput. Vis. Image Underst..

[89]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[90]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[91]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.