Timeception for Complex Action Recognition

This paper focuses on the temporal aspect for recognizing human activities in videos; an important visual cue that has long been undervalued. We revisit the conventional definition of activity and restrict it to Complex Action: a set of one-actions with a weak temporal pattern that serves a specific purpose. Related works use spatiotemporal 3D convolutions with fixed kernel size, too rigid to capture the varieties in temporal extents of complex actions, and too short for long-range temporal modeling. In contrast, we use multi-scale temporal convolutions, and we reduce the complexity of 3D convolutions. The outcome is Timeception convolution layers, which reasons about minute-long temporal patterns, a factor of 8 longer than best related works. As a result, Timeception achieves impressive accuracy in recognizing the human activities of Charades, Breakfast Actions and MultiTHUMOS. Further, we demonstrate that Timeception learns long-range temporal dependencies and tolerate temporal extents of complex actions.

[1]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[3]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[4]  Tinne Tuytelaars,et al.  Rank Pooling for Action Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Nicu Sebe,et al.  Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Haroon Idrees,et al.  The THUMOS challenge on action recognition for videos "in the wild" , 2016, Comput. Vis. Image Underst..

[8]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning For Video Understanding , 2017, ArXiv.

[9]  Andrew Zisserman,et al.  All About VLAD , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Hakan Bilen,et al.  Explorer Action Recognition with Dynamic Image Networks , 2017 .

[11]  Cees Snoek,et al.  Video2vec Embeddings Recognize Events When Examples Are Scarce , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[13]  Li Fei-Fei,et al.  Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos , 2015, International Journal of Computer Vision.

[14]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Dong Liu,et al.  EventNet: A Large Scale Structured Concept Library for Complex Event Detection in Video , 2015, ACM Multimedia.

[17]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[18]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Deva Ramanan,et al.  Attentional Pooling for Action Recognition , 2017, NIPS.

[22]  Andrew Zisserman,et al.  What have We Learned from Deep Representations for Action Recognition? , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[24]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Nicu Sebe,et al.  Spatio-Temporal VLAD Encoding for Human Action Recognition in Videos , 2017, MMM.

[28]  Ivan Laptev,et al.  Learnable pooling with Context Gating for video classification , 2017, ArXiv.

[29]  Limin Wang,et al.  Temporal Segment Networks for Action Recognition in Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Thomas Serre,et al.  The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[35]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[36]  Ali Farhadi,et al.  Asynchronous Temporal Fields for Action Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Zhe Wang,et al.  Towards Good Practices for Very Deep Two-Stream ConvNets , 2015, ArXiv.

[39]  Cees Snoek,et al.  Video Time: Properties, Encoders and Evaluation , 2018, BMVC.

[40]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Arnold W. M. Smeulders,et al.  Unified Embedding and Metric Learning for Zero-Exemplar Event Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Andrea Vedaldi,et al.  Dynamic Image Networks for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[44]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[45]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[46]  Abhinav Gupta,et al.  Videos as Space-Time Region Graphs , 2018, ECCV.

[47]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[48]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Abhinav Gupta,et al.  ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.