Long Short-Term Relation Networks for Video Action Detection

It has been well recognized that modeling human-object or object-object relations would be helpful for detection task. Nevertheless, the problem is not trivial especially when exploring the interactions between human actor, object and scene (collectively as human-context) to boost video action detectors. The difficulty originates from the aspect that reliable relations in a video should depend on not only short-term human-context relation in the present clip but also the temporal dynamics distilled over a long-range span of the video. This motivates us to capture both short-term and long-term relations in a video. In this paper, we present a new Long Short-Term Relation Networks, dubbed as LSTR, that novelly aggregates and propagates relation to augment features for video action detection. Technically, Region Proposal Networks (RPN) is remoulded to first produce 3D bounding boxes, i.e., tubelets, in each video clip. LSTR then models short-term human-context interactions within each clip through spatio-temporal attention mechanism and reasons long-term temporal dynamics across video clips via Graph Convolutional Networks (GCN) in a cascaded manner. Extensive experiments are conducted on four benchmark datasets, and superior results are reported when comparing to state-of-the-art methods.

[1]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[2]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Andrew Zisserman,et al.  Detect to Track and Track to Detect , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Tao Mei,et al.  Recurrent Tubelet Proposal and Recognition Networks for Action Detection , 2018, ECCV.

[5]  Tao Mei,et al.  Hierarchy Parsing for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Cordelia Schmid,et al.  Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Ling-Yu Duan,et al.  Unified Spatio-Temporal Attention Networks for Action Recognition in Videos , 2019, IEEE Transactions on Multimedia.

[9]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[10]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11]  B. S. Manjunath,et al.  Actor Conditioned Attention Maps for Video Action Detection , 2018, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[12]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Zhidong Deng,et al.  Fully Motion-Aware Network for Video Object Detection , 2018, ECCV.

[14]  Suman Saha,et al.  Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos , 2016, BMVC.

[15]  Song-Chun Zhu,et al.  Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[16]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[17]  Tao Mei,et al.  Deep Quantization: Encoding Convolutional Activations with Deep Generative Model , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Tao Mei,et al.  Relation Distillation Networks for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Kaiming He,et al.  Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Yang Wang,et al.  Discriminative figure-centric models for joint action localization and recognition , 2011, 2011 International Conference on Computer Vision.

[22]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[23]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jiawei He,et al.  Generic Tubelet Proposals for Action Localization , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[26]  Abhinav Gupta,et al.  Videos as Space-Time Region Graphs , 2018, ECCV.

[27]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[28]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[29]  Yong Jae Lee,et al.  Video Object Detection with an Aligned Spatial-Temporal Memory , 2017, ECCV.

[30]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[31]  Asim Kadav,et al.  Attend and Interact: Higher-Order Object Interactions for Video Understanding , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[33]  Ramakant Nevatia,et al.  Spatio-Temporal Action Detection with Cascade Proposal and Location Anticipation , 2017, BMVC.

[34]  Kaiming He,et al.  Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Cordelia Schmid,et al.  Action Tubelet Detector for Spatio-Temporal Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[37]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Shuicheng Yan,et al.  Seq-NMS for Video Object Detection , 2016, ArXiv.

[39]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[40]  Xiaogang Wang,et al.  Object Detection in Videos with Tubelet Proposal Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Xiaogang Wang,et al.  T-CNN: Tubelets With Convolutional Neural Networks for Object Detection From Videos , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[42]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[43]  Yichen Wei,et al.  Relation Networks for Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Cordelia Schmid,et al.  Multi-region Two-Stream R-CNN for Action Detection , 2016, ECCV.

[45]  Yujie Wang,et al.  Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[46]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[47]  Suman Saha,et al.  Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[49]  Rui Hou,et al.  Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[50]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[53]  Yichen Wei,et al.  Towards High Performance Video Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Gang Yu,et al.  Human Centric Spatio-Temporal Action Localization , 2018 .

[55]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[56]  Cordelia Schmid,et al.  Actor-Centric Relation Network , 2018, ECCV.

[57]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).