Linearizing Visual Processes with Convolutional Variational Autoencoders

This work studies the problem of modeling non-linear visual processes by learning linear generative models from observed sequences. We propose a joint learning framework, combining a Linear Dynamic System and a Variational Autoencoder with convolutional layers. After discussing several conditions for linearizing neural networks, we propose an architecture that allows Variational Autoencoders to simultaneously learn the non-linear observation as well as the linear state-transition from a sequence of observed frames. The proposed framework is demonstrated experimentally in three series of synthesis experiments.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Bart De Moor,et al.  N4SID: Subspace algorithms for the identification of combined deterministic-stochastic systems , 1994, Autom..

[3]  Roland Memisevic,et al.  Learning to Relate Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Nuno Vasconcelos,et al.  Anomaly detection in crowded scenes , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[6]  Song-Chun Zhu,et al.  Learning Dynamic Generator Model by Alternating Back-Propagation Through Time , 2018, AAAI.

[7]  Stéphane Mallat,et al.  Understanding deep convolutional networks , 2016, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[8]  L. Perko Differential Equations and Dynamical Systems , 1991 .

[9]  Martin Kleinsteuber,et al.  Alignment Distances on Systems of Bags , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  Song-Chun Zhu,et al.  Synthesizing Dynamic Patterns by Spatial-Temporal Generative ConvNet , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Zhongfeng Wang,et al.  Dynamical Textures Modeling via Joint Video Dictionary Learning , 2017, IEEE Transactions on Image Processing.

[12]  Max Welling,et al.  Group Equivariant Convolutional Networks , 2016, ICML.

[13]  Jiajun Wu,et al.  Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks , 2016, NIPS.

[14]  Max Welling,et al.  Transformation Properties of Learned Visual Representations , 2014, ICLR.

[15]  Uri Shalit,et al.  Deep Kalman Filters , 2015, ArXiv.

[16]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[17]  Antonio Manuel López Peña,et al.  Procedural Generation of Videos to Train Deep Action Recognition Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[19]  Mario Sznaier,et al.  DYAN: A Dynamical Atoms-Based Network for Video Prediction , 2018, ECCV.

[20]  Yann LeCun,et al.  Learning to Linearize Under Uncertainty , 2015, NIPS.

[21]  Stéphane Mallat,et al.  Group Invariant Scattering , 2011, ArXiv.

[22]  Thomas Wiatowski,et al.  A Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction , 2015, IEEE Transactions on Information Theory.

[23]  Nuno Vasconcelos,et al.  Probabilistic kernels for the classification of auto-regressive visual processes , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[24]  Antonio Torralba,et al.  Generating the Future with Adversarial Transformers , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Stefano Soatto,et al.  Dynamic Textures , 2003, International Journal of Computer Vision.

[26]  Carl Doersch,et al.  Tutorial on Variational Autoencoders , 2016, ArXiv.

[27]  René Vidal,et al.  The alignment distance on Spaces of Linear Dynamical Systems , 2013, 52nd IEEE Conference on Decision and Control.

[28]  Nuno Vasconcelos,et al.  Classifying Video with Kernel Dynamic Textures , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  José Carlos Príncipe,et al.  Deep Predictive Coding Networks , 2013, ICLR.

[30]  Martin A. Riedmiller,et al.  Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[31]  Ryan P. Adams,et al.  Composing graphical models with neural networks for structured representations and fast inference , 2016, NIPS.

[32]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Geoffrey E. Hinton,et al.  Transforming Auto-Encoders , 2011, ICANN.

[34]  Payam Saisan,et al.  Dynamic texture recognition , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[35]  Xiaoou Tang,et al.  Video Frame Synthesis Using Deep Voxel Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Konstantinos G. Derpanis,et al.  Two-Stream Convolutional Networks for Dynamic Texture Synthesis , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[38]  Maximilian Karl,et al.  Deep Variational Bayes Filters: Unsupervised Learning of State Space Models from Raw Data , 2016, ICLR.