Analysis and synthesis of interactive video sprites

In this thesis, we explore how video, an extremely compelling medium that is traditionally consumed passively, can be transformed into interactive experiences and what is preventing content creators from using it for this purpose. Film captures extremely rich and dynamic information but, due to the sheer amount of data and the drastic change in content appearance over time, it is problematic to work with. Content creators are willing to invest time and effort to design and capture video so why not for manipulating and interacting with it? We hypothesize that people can help and be helped by automatic video processing and synthesis algorithms when they are given the right tools. Computer games are a very popular interactive media where players engage with dynamic content in compelling and intuitive ways. The first contribution of this thesis is an in-depth exploration of the modes of interaction that enable game-like video experiences. Through active discussions with game developers, we identify both how to assist content creators and how their creation can be dynamically interacted with by players. We present concepts, explore algorithms and design tools that together enable interactive video experiences. Our findings concerning processing videos and interacting with filmed content come together in this thesis' second major contribution. We present a new medium of expression where video elements can be looped, merged and triggered interactively. Static-camera videos are converted into loopable sequences that can be controlled in real time in response to simple end-user requests. We present novel algorithms and interactive tools that enable our new medium of expression. Our human-in-the-loop system gives the user progressively more creative control over the video content as they invest more effort and artists help us evaluate it. Monocular, static-camera videos are a good fit for looping algorithms but they have been limited to two-dimensional applications as pixels are reshuffled in space and time on the image plane. The final contribution of this thesis breaks through this barrier by allowing users to interact with filmed objects in a three-dimensional manner. Our novel object tracking algorithm extends existing 2D bounding box trackers with 3D information, such as a well-fitting bounding volume, which in turn enables a new breed of interactive video experiences. The filmed content becomes a three-dimensional playground as users are free to move the virtual camera or the tracked objects and see them from novel viewpoints.

[1]  Maneesh Agrawala,et al.  Automatic Cinemagraph Portraits , 2013, Comput. Graph. Forum.

[2]  Maneesh Agrawala,et al.  Selectively de-animating video , 2012, ACM Trans. Graph..

[3]  Jing Liao,et al.  Fast computation of seamless video loops , 2015, ACM Trans. Graph..

[4]  Mikkel B. Stegmann,et al.  Active appearance models: Theory and cases , 2000 .

[5]  F. Meyer,et al.  Color image segmentation , 1992 .

[6]  Markus Gross,et al.  Practical temporal consistency for image-based graphics applications , 2012, ACM Trans. Graph..

[7]  Scott Workman,et al.  Horizon Lines in the Wild , 2016, BMVC.

[8]  Yizhou Yu,et al.  Audeosynth: Music-driven Video Montage , 2015, ACM Trans. Graph..

[9]  Pascal Fua,et al.  SLIC Superpixels Compared to State-of-the-Art Superpixel Methods , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Hong Wang,et al.  Evolving boxes for fast vehicle detection , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[11]  Jianliang Tang,et al.  Complete Solution Classification for the Perspective-Three-Point Problem , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Irfan A. Essa,et al.  Controlled animation of video sprites , 2002, SCA '02.

[13]  George Drettakis,et al.  Depth synthesis and local warps for plausible image-based navigation , 2013, TOGS.

[14]  Ian D. Reid,et al.  Single View Metrology , 2000, International Journal of Computer Vision.

[15]  Jiawen Chen,et al.  The video mesh: A data structure for image-based three-dimensional video editing , 2011, 2011 IEEE International Conference on Computational Photography (ICCP).

[16]  Thomas Brox,et al.  Single-view to Multi-view: Reconstructing Unseen Views with a Convolutional Network , 2015, ArXiv.

[17]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[18]  David Salesin,et al.  Multiresolution video , 1996, SIGGRAPH.

[19]  M. Goesele,et al.  Floating scale surface reconstruction , 2014, ACM Trans. Graph..

[20]  Thomas Lewiner,et al.  Efficient Implementation of Marching Cubes' Cases with Topological Guarantees , 2003, J. Graphics, GPU, & Game Tools.

[21]  David Salesin,et al.  Image Analogies , 2001, SIGGRAPH.

[22]  Niloy J. Mitra,et al.  Interactive Videos: Plausible Video Editing using Sparse Structure Points , 2016, Comput. Graph. Forum.

[23]  Richard Szeliski,et al.  Modeling the World from Internet Photo Collections , 2008, International Journal of Computer Vision.

[24]  Seth J. Teller,et al.  Particle Video: Long-Range Motion Estimation Using Point Trajectories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[25]  Aljoscha Smolic,et al.  DuctTake: Spatiotemporal Video Compositing , 2013, Comput. Graph. Forum.

[26]  Olga Veksler,et al.  Fast Approximate Energy Minimization via Graph Cuts , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Richard Bowden,et al.  Dense Rigid Reconstruction from Unstructured Discontinuous Video , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[28]  Yuzhen Niu,et al.  Direct manipulation video navigation on touch screens , 2014, MobileHCI '14.

[29]  Massimo Piccardi,et al.  Background subtraction techniques: a review , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[30]  Philip H. S. Torr,et al.  VideoTrace: rapid interactive scene modelling from video , 2007, SIGGRAPH 2007.

[31]  Zoran Popović,et al.  Motion fields for interactive character locomotion , 2010, SIGGRAPH 2010.

[32]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[33]  Mario Fritz,et al.  Novel Views of Objects from a Single Image , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Jana Kosecka,et al.  3D Bounding Box Estimation Using Deep Learning and Geometry , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[36]  Pierre Dragicevic,et al.  Video browsing by direct manipulation , 2008, CHI.

[37]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[38]  Jin Wei,et al.  Timeline Editing of Objects in Video , 2013, IEEE Transactions on Visualization and Computer Graphics.

[39]  Kun Zhou,et al.  Interactive images , 2012, ACM Trans. Graph..

[40]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[41]  Jessica K. Hodgins,et al.  Flow-based video synthesis and editing , 2004, SIGGRAPH 2004.

[42]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  David Salesin,et al.  Panoramic video textures , 2005, ACM Trans. Graph..

[44]  Yuzhen Niu,et al.  Direct manipulation video navigation in 3D , 2013, CHI.

[45]  David Salesin,et al.  Video object annotation, navigation, and composition , 2008, UIST '08.

[46]  Jan Kautz,et al.  Towards Moment Imagery: Automatic Cinemagraphs , 2011, 2011 Conference for Visual Media Production.

[47]  Frédo Durand,et al.  A gentle introduction to bilateral filtering and its applications , 2007, SIGGRAPH Courses.

[48]  Yael Pritch,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2008 1 Non-Chronological Video , 2022 .

[49]  Jessica K. Hodgins,et al.  Interactive control of avatars animated with human motion data , 2002, SIGGRAPH.

[50]  Reinhard Koch,et al.  Visual Modeling with a Hand-Held Camera , 2004, International Journal of Computer Vision.

[51]  Dani Lischinski,et al.  Evolving Time Fronts: Spatio-Temporal Video Warping , 2005 .

[52]  P. J. Narayanan,et al.  Interactive Video Manipulation Using Object Trajectories and Scene Backgrounds , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[53]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Marc Levoy,et al.  Light field rendering , 1996, SIGGRAPH.

[55]  Neel Joshi,et al.  Automated video looping with progressive dynamism , 2013, ACM Trans. Graph..

[56]  Richard Szeliski,et al.  Video textures , 2000, SIGGRAPH.

[57]  Irfan A. Essa,et al.  Graphcut textures: image and video synthesis using graph cuts , 2003, ACM Trans. Graph..

[58]  M. Gordan,et al.  Camera calibration using two or three vanishing points , 2012, 2012 Federated Conference on Computer Science and Information Systems (FedCSIS).

[59]  Scott Workman,et al.  Detecting Vanishing Points Using Global Image Context in a Non-ManhattanWorld , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Atsushi Nakazawa,et al.  Human video textures , 2009, I3D '09.

[61]  Lucas Kovar,et al.  Motion graphs , 2002, SIGGRAPH Classes.

[62]  Stefan Roth,et al.  MOT16: A Benchmark for Multi-Object Tracking , 2016, ArXiv.

[63]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[64]  Ming-Hsuan Yang,et al.  UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking , 2015, Comput. Vis. Image Underst..

[65]  Ken-ichi Anjyo,et al.  Tour into the picture: using a spidery mesh interface to make animation from a single image , 1997, SIGGRAPH.

[66]  Alexei A. Efros,et al.  Automatic photo pop-up , 2005, SIGGRAPH 2005.

[67]  Yaser Sheikh,et al.  3D object manipulation in a single photograph using stock 3D models , 2014, ACM Trans. Graph..

[68]  Rafael Grompone von Gioi,et al.  LSD: a Line Segment Detector , 2012, Image Process. Line.

[69]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[70]  Peter-Pike J. Sloan,et al.  Video Cubism , 2001 .

[71]  D. Barrios-Aranibar,et al.  LEARNING FROM DELAYED REWARDS USING INFLUENCE VALUES APPLIED TO COORDINATION IN MULTI-AGENT SYSTEMS , 2007 .

[72]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[73]  George Wolberg,et al.  Digital image warping , 1990 .

[74]  Harry Shum,et al.  Image-based rendering , 2006, Found. Trends Comput. Graph. Vis..

[75]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Different Scenes , 2008, ECCV.

[76]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[77]  Ran Xu,et al.  Random forests for metric learning with implicit pairwise position dependence , 2012, KDD.

[78]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[79]  Okan Arikan,et al.  Interactive motion generation from examples , 2002, ACM Trans. Graph..

[80]  David Salesin,et al.  Video matting of complex scenes , 2002, SIGGRAPH.

[81]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[82]  Alvaro Collet,et al.  Motion graphs for unstructured textured meshes , 2016, ACM Trans. Graph..

[83]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[84]  Irfan A. Essa,et al.  Machine Learning for Video-Based Rendering , 2000, NIPS.

[85]  Volker Eiselein,et al.  High-Speed tracking-by-detection without using image information , 2017, 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[86]  Vladimir Kolmogorov,et al.  Convergent Tree-Reweighted Message Passing for Energy Minimization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[87]  Daniel Cohen-Or,et al.  3-Sweep , 2013, ACM Trans. Graph..

[88]  Vladlen Koltun,et al.  Fast MRF Optimization with Application to Depth Reconstruction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[89]  Richard Szeliski,et al.  The lumigraph , 1996, SIGGRAPH.

[90]  Roman P. Pflugfelder,et al.  Clustering of static-adaptive correspondences for deformable object tracking , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[91]  J. Collomosse,et al.  4D video textures for interactive character appearance , 2014, Comput. Graph. Forum.

[92]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[93]  Patrick Pérez,et al.  Poisson image editing , 2003, ACM Trans. Graph..

[94]  Simon J. D. Prince,et al.  Computer Vision: Models, Learning, and Inference , 2012 .

[95]  Jan Kautz,et al.  Hierarchical Subquery Evaluation for Active Learning on a Graph , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[96]  Jitendra Malik,et al.  View Synthesis by Appearance Flow , 2016, ECCV.

[97]  Jean-Philippe Tardif,et al.  Non-iterative approach for fast and accurate vanishing point detection , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[98]  Michael S. Langer,et al.  Panoramic stereo video textures , 2011, 2011 International Conference on Computer Vision.

[99]  Sean Hayes,et al.  View synthesis by trinocular edge matching and transfer , 1998, Proceedings Fourth IEEE Workshop on Applications of Computer Vision. WACV'98 (Cat. No.98EX201).

[100]  George Drettakis,et al.  Scalable inside-out image-based rendering , 2016, ACM Trans. Graph..