Global motion estimation with iterative optimization-based independent univariate model for action recognition

Abstract Motion information used in the existed video action recognition schemes is mixing of global motion(GM) and local motion(LM). In fact, GM & LM have their respective semantic concepts. Thus, it is promising to decouple GM and LM from the mixed motions. Numerous efforts have been made on the design of global motion models for video encoding, video dejittering, video denoising, and so on. Nevertheless, some of the models are too basic to cover the camera motions in action recognition while others are over-complicated. In this paper, we focus on the characteristic of the action recognition and propose a novel independent univariate GM model. It ignores camera rotation, which appears rarely in action recognition videos, and represents the GM in x and y direction respectively. Furthermore, GM is position invariant because it is from the universal camera motion. Pixels with global motions are subjected to the same parametric model and pixels with mixed motion can be seen as outliers. Motivated by this, we develop an iterative optimization scheme for GM estimation which removes the outlier points step by step and estimates global motions in a coarse-to-fine manner. Finally, the LM is estimated through a Spatio-temporal threshold-based method. Experimental results demonstrate that the proposed GM model makes a better trade-off between the model complexity and the robustness. And the iterative optimization scheme is more effective than the existed algorithms. The compared experiments using four popular action recognition models on UCF-101 (for action recognition) and NCAA (for group activity recognition) demonstrate that local motions are more effective than the mixed motions.

[1]  Qi Tian,et al.  Participation-Contributed Temporal Dynamic Model for Group Activity Recognition , 2018, ACM Multimedia.

[2]  Georgios D. Evangelidis,et al.  Parametric Image Alignment Using Enhanced Correlation Coefficient Maximization , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Dong-Gyu Lee,et al.  Discriminative context learning with gated recurrent unit for group activity recognition , 2018, Pattern Recognit..

[4]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[5]  Michal Irani,et al.  Video indexing based on mosaic representations , 1998, Proc. IEEE.

[6]  Michael Gleicher,et al.  Content-preserving warps for 3D video stabilization , 2009, ACM Trans. Graph..

[7]  Jan Kautz,et al.  PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Ling-Yu Duan,et al.  Nonparametric motion characterization for robust classification of camera motion patterns , 2006, IEEE Transactions on Multimedia.

[9]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[10]  Mohamed R. Amer,et al.  Sum Product Networks for Activity Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  C. Krishna Mohan,et al.  Graph formulation of video activities for abnormal activity recognition , 2017, Pattern Recognit..

[12]  Jian Sun,et al.  Bundled camera paths for video stabilization , 2013, ACM Trans. Graph..

[13]  Sanjeev R. Kulkarni,et al.  Rapid estimation of camera motion from compressed video with application to video annotation , 2000, IEEE Trans. Circuits Syst. Video Technol..

[14]  G. Sreelekha,et al.  Sub-Block Based Global Motion Estimation for Affine Motion Model , 2018, 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[15]  Qi Tian,et al.  A mid-level representation framework for semantic sports video analysis , 2003, ACM Multimedia.

[16]  Raanan Fattal,et al.  Video stabilization using epipolar geometry , 2012, TOGS.

[17]  Patrick Bouthemy,et al.  A unified approach to shot change detection and camera motion characterization , 1999, IEEE Trans. Circuits Syst. Video Technol..

[18]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[20]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[21]  Siyuan Liu,et al.  Global Motion Pattern Based Event Recognition in Multi-person Videos , 2017, CCCV.

[22]  Greg Mori,et al.  A Hierarchical Deep Temporal Model for Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Jinhui Tang,et al.  Host–Parasite: Graph LSTM-in-LSTM for Group Activity Recognition , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[25]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Wei-Shi Zheng,et al.  Learning Person–Person Interaction in Collective Activity Recognition , 2015, IEEE Transactions on Image Processing.

[28]  Hao Hu,et al.  Learning Compact Features for Human Activity Recognition Via Probabilistic First-Take-All , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Xiangjian He,et al.  CAMHID: Camera Motion Histogram Descriptor and Its Application to Cinematographic Shot Classification , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[30]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[31]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Li Wang,et al.  Learning Actor Relation Graphs for Group Activity Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Min Liu,et al.  Robust global motion estimation for video security based on improved k-means clustering , 2018, Journal of Ambient Intelligence and Humanized Computing.

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Lingling Zhang,et al.  Few-shot activity recognition with cross-modal memory network , 2020, Pattern Recognit..

[36]  Silvio Savarese,et al.  Understanding Collective Activitiesof People from Videos , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Thomas Brox,et al.  Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Truong Q. Nguyen,et al.  Real-Time Affine Global Motion Estimation Using Phase Correlation and its Application for Digital Image Stabilization , 2011, IEEE Transactions on Image Processing.

[39]  Meng Jian,et al.  Ontology-Based Global and Collective Motion Patterns for Event Classification in Basketball Videos , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[40]  Qi Wang,et al.  Fusing Motion Patterns and Key Visual Information for Semantic Event Recognition in Basketball Videos , 2020, Neurocomputing.

[41]  Qi Tian,et al.  A unified framework for semantic shot classification in sports video , 2005, IEEE Trans. Multim..

[42]  Jianping Fan,et al.  Deep Mixture of Diverse Experts for Large-Scale Visual Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Li Fei-Fei,et al.  Detecting Events and Key Actors in Multi-person Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Ming-Ting Sun,et al.  Global motion estimation from coarsely sampled motion vector field and the applications , 2005, IEEE Trans. Circuits Syst. Video Technol..

[45]  Bingbing Ni,et al.  Fine-Grained Video Captioning for Sports Narrative , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Qingming Huang,et al.  Spatial Pyramid-Enhanced NetVLAD With Weighted Triplet Loss for Place Recognition , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[47]  Manish Okade,et al.  Fast camera motion estimation using discrete wavelet transform on block motion vectors , 2012, 2012 Picture Coding Symposium.

[48]  Lin Sun,et al.  Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[49]  S. Erturk,et al.  Digital image stabilization with sub-image phase correlation based global motion estimation , 2003, IEEE Trans. Consumer Electron..

[50]  Hichem Sahbi,et al.  Mid-level features and spatio-temporal context for activity recognition , 2012, Pattern Recognit..

[51]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.