Single Shot Temporal Action Detection

Temporal action detection is a very important yet challenging problem, since videos in real applications are usually long, untrimmed and contain multiple action instances. This problem requires not only recognizing action categories but also detecting start time and end time of each action instance. Many state-of-the-art methods adopt the "detection by classification" framework: first do proposal, and then classify proposals. The main drawback of this framework is that the boundaries of action instance proposals have been fixed during the classification step. To address this issue, we propose a novel Single Shot Action Detector (SSAD) network based on 1D temporal convolutional layers to skip the proposal generation step via directly detecting action instances in untrimmed video. On pursuit of designing a particular SSAD network that can work effectively for temporal action detection, we empirically search for the best network architecture of SSAD due to lacking existing models that can be directly adopted. Moreover, we investigate into input feature types and fusion strategies to further improve detection accuracy. We conduct extensive experiments on two challenging datasets: THUMOS 2014 and MEXaction2. When setting Intersection-over-Union threshold to 0.5 during evaluation, SSAD significantly outperforms other state-of-the-art systems by increasing mAP from $19.0%$ to $24.6%$ on THUMOS 2014 and from 7.4% to $11.0%$ on MEXaction2.

[1]  Shih-Fu Chang,et al.  Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[5]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Cordelia Schmid,et al.  The LEAR submission at Thumos 2014 , 2014 .

[7]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[9]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[10]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[12]  Ming Shao,et al.  A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[14]  Juergen Gall,et al.  Temporal Action Detection Using a Statistical Language Model , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Cees Snoek,et al.  APT: Action localization proposals from dense trajectories , 2015, BMVC.

[17]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[19]  Cees Snoek,et al.  Spot On: Action Localization from Pointly-Supervised Proposals , 2016, ECCV.

[20]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Gregory D. Hager,et al.  Temporal Convolutional Networks: A Unified Approach to Action Segmentation , 2016, ECCV Workshops.

[22]  Shawn D. Newsam,et al.  Efficient Action Detection in Untrimmed Videos via Multi-task Learning , 2016, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[23]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[24]  Gang Yu,et al.  Fast action proposals for human action detection and search , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Zhe Wang,et al.  Towards Good Practices for Very Deep Two-Stream ConvNets , 2015, ArXiv.

[26]  Tao Mei,et al.  Deep Quantization: Encoding Convolutional Activations with Deep Generative Model , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[29]  Bernard Ghanem,et al.  Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Fabio Cuzzolin,et al.  Untrimmed Video Classification for Activity Detection: submission to ActivityNet Challenge , 2016, ArXiv.

[32]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[33]  Li Fei-Fei,et al.  End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[37]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Xiaoou Tang,et al.  Action Recognition and Detection by Combining Motion and Appearance Features , 2014 .

[39]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[40]  Bingbing Ni,et al.  Temporal Action Localization with Pyramid of Score Distribution Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Bernard Ghanem,et al.  DAPs: Deep Action Proposals for Action Understanding , 2016, ECCV.