OadTR: Online Action Detection with Transformers

Most recent approaches for online action detection tend to apply Recurrent Neural Network (RNN) to capture longrange temporal structure. However, RNN suffers from nonparallelism and gradient vanishing, hence it is hard to be optimized. In this paper, we propose a new encoderdecoder framework based on Transformers, named OadTR, to tackle these problems. The encoder attached with a task token aims to capture the relationships and global interactions between historical observations. The decoder extracts auxiliary information by aggregating anticipated future clip representations. Therefore, OadTR can recognize current actions by encoding historical information and predicting future context simultaneously. We extensively evaluate the proposed OadTR on three challenging datasets: HDD, TVSeries, and THUMOS14. The experimental results show that OadTR achieves higher training and inference speeds than current RNN based approaches, and significantly outperforms the state-of-the-art methods in terms of both mAP and mcAP. Code is available at https: //github.com/wangxiang1230/OadTR.

[1]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[2]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Rahul Sukthankar,et al.  Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[6]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[7]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[9]  John F. Canny,et al.  Grounding Human-To-Vehicle Advice for Self-Driving Vehicles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[12]  Yang Yang,et al.  Boundary Content Graph Neural Network for Temporal Action Proposal Generation , 2020, ECCV.

[13]  Ramakant Nevatia,et al.  RED: Reinforced Encoder-Decoder Networks for Action Anticipation , 2017, BMVC.

[14]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[15]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[16]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[17]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Cees Snoek,et al.  Online Action Detection , 2016, ECCV.

[19]  Larry S. Davis,et al.  StartNet: Online Detection of Action Start in Untrimmed Videos , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Kurt Keutzer,et al.  Visual Transformers: Token-based Image Representation and Processing for Computer Vision , 2020, ArXiv.

[21]  Xin Li,et al.  Deep Concept-wise Temporal Convolutional Networks for Action Localization , 2019, ACM Multimedia.

[22]  Bumsub Ham,et al.  Learning Memory-Guided Normality for Anomaly Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Shih-Fu Chang,et al.  Online Action Detection in Untrimmed, Streaming Videos - Modeling and Evaluation , 2018, ArXiv.

[24]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[25]  Song-Chun Zhu,et al.  Joint inference of groups, events and human roles in aerial videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Hyunjun Eun,et al.  Learning to Discriminate Information for Online Action Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[30]  Hui Xiong,et al.  Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting , 2020, AAAI.

[31]  Rongrong Ji,et al.  Fast Learning of Temporal Action Proposal via Dense Boundary Generator , 2019, AAAI.

[32]  Luxi Yang,et al.  ConvTransformer: A Convolutional Transformer Network for Video Frame Synthesis , 2020, ArXiv.

[33]  A. Clark Whatever next? Predictive brains, situated agents, and the future of cognitive science. , 2013, The Behavioral and brain sciences.

[34]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Changxin Gao,et al.  Multi-Level Temporal Pyramid Network for Action Detection , 2020, PRCV.

[38]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[39]  Chunhua Shen,et al.  Self-Trained Deep Ordinal Regression for End-to-End Video Anomaly Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Tinne Tuytelaars,et al.  Modeling Temporal Structure with LSTM for Online Action Detection , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[41]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Zejian Yuan,et al.  End-to-end Lane Shape Prediction with Transformers , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[43]  Ming Yang,et al.  BSN: Boundary Sensitive Network for Temporal Action Proposal Generation , 2018, ECCV.

[44]  Bernard Ghanem,et al.  G-TAD: Sub-Graph Localization for Temporal Action Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Xu Zhao,et al.  Single Shot Temporal Action Detection , 2017, ACM Multimedia.

[46]  Ricarda I. Schubotz,et al.  Prediction, Cognition and the Brain , 2009, Front. Hum. Neurosci..

[47]  Kate Saenko,et al.  Toward Driving Scene Understanding: A Dataset for Learning Driver Behavior and Causal Reasoning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Changxin Gao,et al.  Self-Supervised Learning for Semi-Supervised Temporal Action Proposal , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Shilei Wen,et al.  BMN: Boundary-Matching Network for Temporal Action Proposal Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Li Fei-Fei,et al.  Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos , 2015, International Journal of Computer Vision.

[51]  Shih-Fu Chang,et al.  CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[53]  Xiaogang Wang,et al.  End-to-End Object Detection with Adaptive Clustering Transformer , 2020, BMVC.

[54]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[55]  Wei Wu,et al.  Temporal Context Aggregation Network for Temporal Action Proposal Refinement , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Yi Zhu,et al.  Hidden Two-Stream Convolutional Networks for Action Recognition , 2017, ACCV.

[57]  Zachary Chase Lipton A Critical Review of Recurrent Neural Networks for Sequence Learning , 2015, ArXiv.

[58]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[59]  Wenjun Zeng,et al.  Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks , 2016, ECCV.

[60]  Larry S. Davis,et al.  Temporal Recurrent Networks for Online Action Detection , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).