GLM-Net: Global and Local Motion Estimation via Task-Oriented Encoder-Decoder Structure

In this work, we study the problem of separating the global camera motion and the local dynamic motion from an optical flow. Previous methods either estimate global motions by a parametric model, such as a homography, or estimate both of them by an optical flow field. However, none of these methods can directly estimate global and local motions through an end-to-end manner. In addition, separating the two motions accurately from a hybrid flow field is challenging. Because one motion can easily confuse the estimate of the other one when they are compounded together. To this end, we propose an end-to-end global and local motion estimation network GLM-Net. We design two encoder-decoder structures for the motion separation in the optical flow based on different task orientations. One structure adopts a mask autoencoder to extract the global motion, while the other one uses attention U-net for the local motion refinement. We further designed two effective training methods to overcome the problem of lacking supervisions. We apply our method on the action recognition datasets NCAA and UCF-101 to verify the accuracy of the local motion, and the homography estimation dataset DHE for the accuracy of the global motion. Experimental results show that our method can achieve competitive performance in both tasks at the same time, validating the effectiveness of the motion separation.

[1]  Karteek Alahari,et al.  Learning Motion Patterns in Videos , 2016, CVPR.

[2]  Irfan A. Essa,et al.  Auto-directed video stabilization with robust L1 optimal camera paths , 2011, CVPR 2011.

[3]  Sinisa Todorovic,et al.  Boundary Flow: A Siamese Network that Predicts Boundary Motion Without Training on Motion , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Lifang Wu,et al.  Global motion estimation with iterative optimization-based independent univariate model for action recognition , 2021, Pattern Recognit..

[5]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Erik G. Learned-Miller,et al.  It's Moving! A Probabilistic Model for Causal Motion Segmentation in Moving Camera Videos , 2016, ECCV.

[7]  Qi Wang,et al.  Fusing Motion Patterns and Key Visual Information for Semantic Event Recognition in Basketball Videos , 2020, Neurocomputing.

[8]  Truong Q. Nguyen,et al.  Moving Object Detection With a Freely Moving Camera via Background Motion Subtraction , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[9]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[10]  Jan Kautz,et al.  PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Liang Lin,et al.  Adaptive Temporal Encoding Network for Video Instance-level Human Parsing , 2018, ACM Multimedia.

[12]  Dinesh Ganotra,et al.  An overview of optical flow-based approaches for motion segmentation , 2019, The Imaging Science Journal.

[13]  Tomasz Malisiewicz,et al.  Deep Image Homography Estimation , 2016, ArXiv.

[14]  Jue Wang,et al.  Content-Aware Unsupervised Deep Homography Estimation , 2020, ECCV.

[15]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[16]  Jian Sun,et al.  SteadyFlow: Spatially Smooth Optical Flow for Video Stabilization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Shuicheng Yan,et al.  A Simple Baseline for Pose Tracking in Videos of Crowed Scenes , 2020, ACM Multimedia.

[18]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Loïc Le Folgoc,et al.  Attention U-Net: Learning Where to Look for the Pancreas , 2018, ArXiv.

[20]  Thomas S. Huang,et al.  Motion Pyramid Networks for Accurate and Efficient Cardiac Motion Estimation , 2020, MICCAI.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Li Fei-Fei,et al.  Detecting Events and Key Actors in Multi-person Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Zaïd Harchaoui,et al.  Object Discovery in Videos as Foreground Motion Clustering , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[26]  Jiyang Yu,et al.  Robust Video Stabilization by Optimization in CNN Weight Space , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Allen R. Hanson,et al.  Coherent Motion Segmentation in Moving Camera Videos Using Optical Flow Orientations , 2013, 2013 IEEE International Conference on Computer Vision.

[28]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Xiao Chen,et al.  FOAL: Fast Online Adaptive Learning for Cardiac Motion Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[31]  Vijay Kumar,et al.  Unsupervised Deep Homography: A Fast and Robust Homography Estimation Model , 2017, IEEE Robotics and Automation Letters.

[32]  Jia Deng,et al.  RAFT: Recurrent All-Pairs Field Transforms for Optical Flow , 2020, ECCV.

[33]  Zhiping Cai,et al.  Cloze Test Helps: Effective Video Anomaly Detection via Learning to Complete Video Events , 2020, ACM Multimedia.

[34]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[35]  Wenbin Chen,et al.  Video Stabilization Using Scale-Invariant Features , 2007, 2007 11th International Conference Information Visualization (IV '07).

[36]  Lifang Wu,et al.  Key frame extraction based on global motion statistics for team-sport videos , 2021, Multimedia Systems.

[37]  Xiangjian He,et al.  CAMHID: Camera Motion Histogram Descriptor and Its Application to Cinematographic Shot Classification , 2014, IEEE Transactions on Circuits and Systems for Video Technology.