Motion-Based Generator Model: Unsupervised Disentanglement of Appearance, Trackable and Intrackable Motions in Dynamic Patterns

Dynamic patterns are characterized by complex spatial and motion patterns. Understanding dynamic patterns requires a disentangled representational model that separates the factorial components. A commonly used model for dynamic patterns is the state space model, where the state evolves over time according to a transition model and the state generates the observed image frames according to an emission model. To model the motions explicitly, it is natural for the model to be based on the motions or the displacement fields of the pixels. Thus in the emission model, we let the hidden state generate the displacement field, which warps the trackable component in the previous image frame to generate the next frame while adding a simultaneously emitted residual image to account for the change that cannot be explained by the deformation. The warping of the previous image is about the trackable part of the change of image frame, while the residual image is about the intrackable part of the image. We use a maximum likelihood algorithm to learn the model that iterates between inferring latent noise vectors that drive the transition model and updating the parameters given the inferred latent vectors. Meanwhile we adopt a regularization term to penalize the norms of the residual images to encourage the model to explain the change of image frames by trackable motion. Unlike existing methods on dynamic patterns, we learn our model in unsupervised setting without ground truth displacement fields. In addition, our model defines a notion of intrackability by the separation of warped component and residual component in each image frame. We show that our method can synthesize realistic dynamic pattern, and disentangling appearance, trackable and intrackable motions. The learned models are useful for motion transfer, and it is natural to adopt it to define and measure intrackability of a dynamic pattern.

[1]  Song-Chun Zhu,et al.  Analysis and synthesis of textured motion: particles and waves , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[3]  Song-Chun Zhu,et al.  Learning Dynamic Generator Model by Alternating Back-Propagation Through Time , 2018, AAAI.

[4]  Yang Lu,et al.  Cooperative Training of Descriptor and Generator Networks , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Song-Chun Zhu,et al.  Synthesizing Dynamic Patterns by Spatial-Temporal Generative ConvNet , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Narendra Ahuja,et al.  Maximum Margin Distance Learning for Dynamic Texture Recognition , 2010, ECCV.

[7]  Song-Chun Zhu,et al.  Intrackability: Characterizing Video Statistics and Pursuing Video Representations , 2012, International Journal of Computer Vision.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Shunta Saito,et al.  Temporal Generative Adversarial Nets with Singular Value Clipping , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Z. Pylyshyn,et al.  Dynamics of target selection in multiple object tracking (MOT). , 2006, Spatial vision.

[11]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[12]  Tian Han,et al.  Learning Generator Networks for Dynamic Patterns , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[13]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[14]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[15]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Yann LeCun,et al.  Disentangling factors of variation in deep representation using adversarial training , 2016, NIPS.

[17]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[18]  L. Younes On the convergence of markovian stochastic algorithms with rapidly decreasing ergodicity rates , 1999 .

[19]  Stefano Soatto,et al.  Dynamic Textures , 2003, International Journal of Computer Vision.

[20]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[21]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Yang Lu,et al.  A Theory of Generative ConvNet , 2016, ICML.

[25]  Zongben Xu,et al.  Video Primal Sketch: A Unified Middle-Level Representation for Video , 2015, Journal of Mathematical Imaging and Vision.

[26]  Anastasios Delopoulos,et al.  The MUG facial expression database , 2010, 11th International Workshop on Image Analysis for Multimedia Interactive Services WIAMIS 10.

[27]  Konstantinos G. Derpanis,et al.  Two-Stream Convolutional Networks for Dynamic Texture Synthesis , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Zheng Li,et al.  Dynamic Feature Cascade for Multiple Object Tracking with Trackability Analysis , 2007, EMMCVPR.

[29]  Karol Gregor,et al.  Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[30]  Vladlen Koltun,et al.  Photographic Image Synthesis with Cascaded Refinement Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[32]  Song-Chun Zhu,et al.  A Generative Method for Textured Motion: Analysis and Synthesis , 2002, ECCV.

[33]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[34]  Richard Szeliski,et al.  A Database and Evaluation Methodology for Optical Flow , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[35]  Song-Chun Zhu,et al.  From Information Scaling of Natural Images to Regimes of Statistical Models , 2007 .

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.