Video Primal Sketch: A Unified Middle-Level Representation for Video

This paper presents a middle-level video representation named video primal sketch (VPS), which integrates two regimes of models: (i) sparse coding model using static or moving primitives to explicitly represent moving corners, lines, feature points, etc., (ii) FRAME /MRF model reproducing feature statistics extracted from input video to implicitly represent textured motion, such as water and fire. The feature statistics include histograms of spatio-temporal filters and velocity distributions. This paper makes three contributions to the literature: (i) Learning a dictionary of video primitives using parametric generative models; (ii) Proposing the spatio-temporal FRAME and motion-appearance FRAME models for modeling and synthesizing textured motion; and (iii) Developing a parsimonious hybrid model for generic video representation. Given an input video, VPS selects the proper models automatically for different motion patterns and is compatible with high-level action representations. In the experiments, we synthesize a number of textured motion; reconstruct real videos using the VPS; report a series of human perception experiments to verify the quality of reconstructed videos; demonstrate how the VPS changes over the scale transition in videos; and present the close connection between VPS and high-level action models.

[1]  Michael S. Landy,et al.  Orthogonal Distribution Analysis: A New Approach to the Study of Texture Perception , 1991 .

[2]  Benjamin Z. Yao,et al.  Learning deformable action templates from cluttered videos , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[3]  Song-Chun Zhu,et al.  Intrackability: Characterizing Video Statistics and Pursuing Video Representations , 2012, International Journal of Computer Vision.

[4]  Song-Chun Zhu,et al.  Analysis and synthesis of textured motion: particles and waves , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Zongben Xu,et al.  Video Primal Sketch: A generic middle-level representation of video , 2011, 2011 International Conference on Computer Vision.

[6]  Neill W. Campbell,et al.  Practical generation of video textures using the auto-regressive process , 2004, Image Vis. Comput..

[7]  Luc Van Gool,et al.  Probabilistic object tracking using multiple features , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[8]  Steven W. Zucker,et al.  Local Scale Control for Edge Detection and Blur Estimation , 1996, ECCV.

[9]  Gregory Shakhnarovich,et al.  Sparse Coding for Learning Interpretable Spatio-Temporal Primitives , 2010, NIPS.

[10]  Martin Szummer,et al.  Temporal texture modeling , 1996, Proceedings of 3rd IEEE International Conference on Image Processing.

[11]  Refractor Vision , 2000, The Lancet.

[12]  Song-Chun Zhu,et al.  Filters, Random Fields and Maximum Entropy (FRAME): Towards a Unified Theory for Texture Modeling , 1998, International Journal of Computer Vision.

[13]  Song-Chun Zhu,et al.  Primal Sketch: Integrating Texture and Structure , 2011 .

[14]  Zhou Wang,et al.  Image Quality Assessment: From Error Measurement to Structural Similarity , 2004 .

[15]  Dorin Comaniciu,et al.  Kernel-Based Object Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Nuno Vasconcelos,et al.  Modeling, Clustering, and Segmenting Video with Mixtures of Dynamic Textures , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[18]  Junsong Yuan,et al.  Middle-Level Representation for Human Activities Recognition: The Role of Spatio-Temporal Relationships , 2010, ECCV Workshops.

[19]  Gregory D. Hager,et al.  Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions , 2009, CVPR.

[20]  René Vidal,et al.  View-invariant dynamic texture recognition using a bag of dynamical systems , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[22]  Song-Chun Zhu,et al.  Primal sketch: Integrating structure and texture , 2007, Comput. Vis. Image Underst..

[23]  ZhuSong-Chun,et al.  Video Primal Sketch , 2015 .

[24]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[25]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[26]  Andrew Blake,et al.  A Probabilistic Exclusion Principle for Tracking Multiple Objects , 2000, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[27]  Patrick Bouthemy,et al.  Mixed-State Auto-Models and Motion Texture Modeling , 2006, Journal of Mathematical Imaging and Vision.

[28]  Eero P. Simoncelli,et al.  A Parametric Texture Model Based on Joint Statistics of Complex Wavelet Coefficients , 2000, International Journal of Computer Vision.

[29]  Stefano Soatto,et al.  Dynamic Textures , 2003, International Journal of Computer Vision.

[30]  R. L. Valois,et al.  Spatial-frequency organization in primate striate cortex ( cortical modules / multiple channels / cytochrome oxidase " blobs " ) , 2022 .

[31]  Richard P. Wildes,et al.  Qualitative Spatiotemporal Analysis Using an Oriented Energy Representation , 2000, ECCV.

[32]  D J Heeger,et al.  Model for the extraction of image flow. , 1987, Journal of the Optical Society of America. A, Optics and image science.

[33]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[34]  Song-Chun Zhu,et al.  Mapping Natural Image Patches by Explicit and Implicit Manifolds , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Richard P. Wildes,et al.  Dynamic texture recognition based on distributions of spacetime oriented structure , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[36]  M. Silverman,et al.  Spatial-frequency organization in primate striate cortex. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Gang Hua,et al.  Efficient Optimal Kernel Placement for Reliable Visual Tracking , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[38]  Bruno A. Olshausen,et al.  Learning sparse, overcomplete representations of time-varying natural images , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[39]  E H Adelson,et al.  Spatiotemporal energy models for the perception of motion. , 1985, Journal of the Optical Society of America. A, Optics and image science.

[40]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[41]  Song-Chun Zhu,et al.  Equivalence of Julesz Ensembles and FRAME Models , 2000, International Journal of Computer Vision.

[42]  David J. Fleet,et al.  Probabilistic Detection and Tracking of Motion Boundaries , 2000, International Journal of Computer Vision.

[43]  Tony Lindeberg,et al.  Scale-Space with Casual Time Direction , 1996, ECCV.

[44]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..