ActionSnapping: Motion-Based Video Synchronization

Video synchronization is a fundamental step for many applications in computer vision, ranging from video morphing to motion analysis. We present a novel method for synchronizing action videos where a similar action is performed by different people at different times and different locations with different local speed changes, e.g., as in sports like weightlifting, baseball pitch, or dance. Our approach extends the popular “snapping” tool of video editing software and allows users to automatically snap action videos together in a timeline based on their content. Since the action can take place at different locations, existing appearance-based methods are not appropriate. Our approach leverages motion information, and computes a nonlinear synchronization of the input videos to establish frame-to-frame temporal correspondences. We demonstrate our approach can be applied for video synchronization, video annotation, and action snapshots. Our approach has been successfully evaluated with ground truth data and a user study.

[1]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[2]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[3]  Hanspeter Pfister,et al.  Video Snapshots: Creating High-Quality Images from Video Clips , 2012, IEEE Transactions on Visualization and Computer Graphics.

[4]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[5]  Seth Teller,et al.  Video matching , 2004, SIGGRAPH 2004.

[6]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[7]  David J. Fleet,et al.  Temporal motion models for monocular and multiview 3D human body tracking , 2006, Comput. Vis. Image Underst..

[8]  Fei Yang,et al.  Facial expression editing in video using a temporally-smooth factorization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Thabo Beeler,et al.  FaceDirector: Continuous Control of Facial Performance in Video , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Seth J. Teller,et al.  Particle Video: Long-Range Motion Estimation Using Point Trajectories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[11]  Edward H. Adelson,et al.  Motion without movement , 1991, SIGGRAPH.

[12]  Fernando De la Torre,et al.  Generalized time warping for multi-modal alignment of human motion , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Changchang Wu,et al.  Towards Linear-Time Incremental Structure from Motion , 2013, 2013 International Conference on 3D Vision.

[14]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[15]  Hans-Peter Seidel,et al.  Markerless Motion Capture with unsynchronized moving cameras , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Christian Bauckhage,et al.  Efficient Subframe Video Alignment Using Short Descriptors , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Fernando De la Torre,et al.  Canonical Time Warping for Alignment of Human Behavior , 2009, NIPS.

[18]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Ira Kemelmacher-Shlizerman,et al.  Being John Malkovich , 2010, ECCV.

[20]  Marcus A. Magnor,et al.  Sampling based scene-space video processing , 2015, ACM Trans. Graph..

[21]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[22]  Patrick Pérez,et al.  Automatic Face Reenactment , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  C. Leung,et al.  Animating animal motion from still , 2008, SIGGRAPH 2008.

[24]  Andrew W. Fitzgibbon,et al.  Efficient regression of general-activity human poses from depth images , 2011, 2011 International Conference on Computer Vision.

[25]  Antonio Manuel López Peña,et al.  Joint Spatio-Temporal Alignment of Sequences , 2013, IEEE Transactions on Multimedia.

[26]  Steven S. Beauchemin,et al.  The computation of optical flow , 1995, CSUR.

[27]  Yi Yang,et al.  Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Markus H. Gross,et al.  VideoSnapping , 2014 .

[29]  Wen Gao,et al.  Robust Estimation of 3D Human Poses from a Single Image , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[31]  Wojciech Matusik,et al.  Video face replacement , 2011, ACM Trans. Graph..

[32]  Patrick Bouthemy,et al.  Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Vincent Lepetit,et al.  From Canonical Poses to 3D Motion Capture Using a Single Camera , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[35]  Carlo Tomasi,et al.  Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Marc Pollefeys,et al.  Unstructured video-based rendering: interactive exploration of casually captured videos , 2010, SIGGRAPH 2010.

[37]  M. Irani,et al.  Spatio-Temporal Alignment of Sequences , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  Shai Avidan,et al.  Photo Sequencing , 2012, ECCV.

[39]  Andrew Zisserman,et al.  Three things everyone should know to improve object retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[41]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[42]  Eli Shechtman,et al.  Matching Local Self-Similarities across Images and Videos , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Daniel Cohen-Or,et al.  RingIt: Ring-Ordering Casual Photos of a Temporal Event , 2015, ACM Trans. Graph..

[45]  Kari Pulli,et al.  Style translation for human motion , 2005, SIGGRAPH 2005.

[46]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[47]  Jing Liao,et al.  Semi‐Automated Video Morphing , 2014, Comput. Graph. Forum.

[48]  Andrew Zisserman,et al.  Efficient Additive Kernels via Explicit Feature Maps , 2012, IEEE Trans. Pattern Anal. Mach. Intell..

[49]  Jing Liao,et al.  Automating Image Morphing Using Structural Similarity on a Halfway Domain , 2014, ACM Trans. Graph..

[50]  Neel Joshi,et al.  Automated video looping with progressive dynamism , 2013, ACM Trans. Graph..