Motion Feature Network: Fixed Motion Filter for Action Recognition

Spatio-temporal representations in frame sequences play an important role in the task of action recognition. Previously, a method of using optical flow as a temporal information in combination with a set of RGB images that contain spatial information has shown great performance enhancement in the action recognition tasks. However, it has an expensive computational cost and requires two-stream (RGB and optical flow) framework. In this paper, we propose MFNet (Motion Feature Network) containing motion blocks which make it possible to encode spatio-temporal information between adjacent frames in a unified network that can be trained end-to-end. The motion block can be attached to any existing CNN-based action recognition frameworks with only a small additional cost. We evaluated our network on two of the action recognition datasets (Jester and Something-Something) and achieved competitive performances for both datasets by training the networks from scratch.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[4]  Richard P. Wildes,et al.  Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Kurt Keutzer,et al.  Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Jonghyun Choi,et al.  ActionFlowNet: Learning Motion Representation for Action Recognition , 2016, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[9]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[10]  Wei Zhang,et al.  Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Yutaka Satoh,et al.  Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[13]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[16]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[17]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[18]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[19]  Limin Wang,et al.  Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[21]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Bowen Zhang,et al.  Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[25]  Shih-Fu Chang,et al.  ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.

[26]  Ivan Laptev,et al.  Learnable pooling with Context Gating for video classification , 2017, ArXiv.

[27]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[28]  Yi Zhu,et al.  Hidden Two-Stream Convolutional Networks for Action Recognition , 2017, ACCV.

[29]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[30]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jun Wang,et al.  Fusing Multi-Stream Deep Networks for Video Classification , 2015, ArXiv.

[32]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[34]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[35]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[36]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.