Multi-Temporal-Resolution Technique for Action Recognition using C3D: Experimental Study

In any given video containing an action, the motion conveys information complementary to the individual frames. This motion varies in speed for similar actions. Therefore, it is a promising approach to train a separate deep-learning model for different versions of action speeds. In this paper, two novel ideas are explored: single-temporal-resolution single-model (STR-SM) and multi-temporal-resolution multi-model (MTR-MM). The STR-SM model is trained on one specific temporal resolution of the action dataset. This allows the model to accept a longer temporal frame range as input and therefore, a faster action classification. On the other hand, the MTR-MM is a set of STR-SM models, each trained on a different temporal resolution with a late fusion using majority voting achieving more accurate action recognition. Both models have improvements over the traditional training approach, 3.63% and 6% video-wise accuracy respectively.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[4]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[5]  Rahul Sukthankar,et al.  Violence Detection in Video Using Computer Vision Techniques , 2011, CAIP.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Howida A. Shedeed,et al.  A Study of Action Recognition Problems: Dataset and Architectures Perspectives , 2018 .

[8]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[9]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[10]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ting Liu,et al.  Recent advances in convolutional neural networks , 2015, Pattern Recognit..

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[15]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Xiaoshuai Sun,et al.  Two-Stream 3-D convNet Fusion for Action Recognition in Videos With Arbitrary Size and Length , 2018, IEEE Transactions on Multimedia.

[19]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[21]  Michael Milford,et al.  Action recognition: From static datasets to moving robots , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[22]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).