Video-level Multi-model Fusion for Action Recognition

The approaches based on spatio-temporal features for video action recognition have emerged such as two-stream based methods and 3D convolution based methods. However, current methods suffer from the problems caused by partial observation, or restricted to single information modeling, and so on. Segment-level recognition results obtained from dense sampling can not represent the entire video and, therefore lead to partial observation. And a single model is hard to capture the complementary information on spacial, temporal and spatio-temporal information from video at the same time. Therefore, the challenge is to build the video-level representation and capture multiple information. In this paper, a video-level multi-model fusion action recognition method is proposed to solve these problems. Firstly, an efficient video-level 3D convolution model is proposed to get the global information in the video which assembling segment-level 3D convolution models. Secondly, a multi-model fusion architecture is proposed for video action recognition to capture multiple information. The spatial, temporal and spatio-temporal information are aggregate with SVM classifier. Experimental results show that this method achieves the state-of-the-art performance on the datasets of UCF-101(97.6%) without pre-training on Kinetics.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Kristen Grauman,et al.  Im2Flow: Motion Hallucination from Static Images for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[8]  Michael J. Black,et al.  A Naturalistic Open Source Movie for Optical Flow Evaluation , 2012, ECCV.

[9]  Jan Kautz,et al.  PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[12]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Lin Li,et al.  End-to-end Video-level Representation Learning for Action Recognition , 2017, 2018 24th International Conference on Pattern Recognition (ICPR).

[16]  Hamid R. Rabiee,et al.  Detection and localization of crowd behavior using a novel tracklet-based model , 2018, Int. J. Mach. Learn. Cybern..

[17]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Bowen Zhang,et al.  Real-Time Action Recognition With Deeply Transferred Motion Vector CNNs , 2018, IEEE Transactions on Image Processing.

[19]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Michael J. Black,et al.  On the Integration of Optical Flow and Action Recognition , 2017, GCPR.

[21]  Alexander J. Smola,et al.  Compressed Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[25]  Christian Heipke,et al.  Joint 3d Estimation of Vehicles and Scene Flow , 2015 .

[26]  Chuang Gan,et al.  End-to-End Learning of Motion Representation for Video Understanding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Joachim Weickert,et al.  Lucas/Kanade Meets Horn/Schunck: Combining Local and Global Optic Flow Methods , 2005, International Journal of Computer Vision.

[28]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Michael J. Black,et al.  Optical Flow Estimation Using a Spatial Pyramid Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Xiaoyan Sun,et al.  MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Michael J. Black,et al.  Learning Optical Flow , 2008, ECCV.

[34]  Mubarak Shah,et al.  Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[37]  Alex Graves,et al.  Long Short-Term Memory , 2020, Computer Vision.

[38]  Xiao Liu,et al.  Multimodal Keyless Attention Fusion for Video Classification , 2018, AAAI.

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Andrea Vedaldi,et al.  Transactions on Pattern Analysis and Machine Intelligence 1 Action Recognition with Dynamic Image Networks , 2022 .