Stochastic Video Prediction with Conditional Density Estimation

Frame-to-frame stochasticity is a major challenge in video prediction. The use of standard feedforward and recurrent networks for video prediction leads to averaging of future states, which can in part be attributed to the networks’ limited ability to model stochasticity. We propose the use of conditional variational autoencoders (CVAE) for video prediction. To model multi-modal densities in frame-to-frame transitions, we extend the CVAE framework by modeling the latent variable with a mixture of Gaussians in the generative network. We tested our proposed Gaussian mixture CVAE (GM-CVAE) on a simple video-prediction task involving a stochastically moving object. Our architecture demonstrates improved performance, achieving noticeably lower rates of blurring/averaging compared to a feedforward network and a Gaussian CVAE. We also describe how the CVAE framework can be applied to improve existing deterministic video prediction models.1