Single-Frame Regularization for Temporally Stable CNNs

Convolutional neural networks (CNNs) can model complicated non-linear relations between images. However, they are notoriously sensitive to small changes in the input. Most CNNs trained to describe image-to-image mappings generate temporally unstable results when applied to video sequences, leading to flickering artifacts and other inconsistencies over time. In order to use CNNs for video material, previous methods have relied on estimating dense frame-to-frame motion information (optical flow) in the training and/or the inference phase, or by exploring recurrent learning structures. We take a different approach to the problem, posing temporal stability as a regularization of the cost function. The regularization is formulated to account for different types of motion that can occur between frames, so that temporally stable CNNs can be trained without the need for video material or expensive motion estimation. The training can be performed as a fine-tuning operation, without architectural modifications of the CNN. Our evaluation shows that the training strategy leads to large improvements in temporal smoothness. Moreover, for small datasets the regularization can help in boosting the generalization performance to a much larger extent than what is possible with naive augmentation strategies.

[1]  Alan L. Yuille,et al.  Region-based temporally consistent video post-processing , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Aleksander Madry,et al.  Exploring the Landscape of Spatial Robustness , 2017, ICML.

[3]  Nenghai Yu,et al.  Coherent Online Video Style Transfer , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Alex Graves,et al.  Video Pixel Networks , 2016, ICML.

[5]  Sylvain Paris,et al.  Blind video temporal consistency , 2015, ACM Trans. Graph..

[6]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[7]  Aleksander Madry,et al.  A Rotation and a Translation Suffice: Fooling CNNs with Simple Transformations , 2017, ArXiv.

[8]  Jin Wang,et al.  Weight smoothing to improve network generalization , 1994, IEEE Trans. Neural Networks.

[9]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[10]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[11]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[12]  Hiroshi Ishikawa,et al.  Let there be color! , 2016, ACM Trans. Graph..

[13]  Gabriel Eilertsen,et al.  HDR image reconstruction from a single exposure using deep CNNs , 2017, ACM Trans. Graph..

[14]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[15]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Renjie Liao,et al.  Detail-Revealing Deep Video Super-Resolution , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[19]  Ersin Yumer,et al.  Learning Blind Video Temporal Consistency , 2018, ECCV.

[20]  Liang Wang,et al.  Bidirectional Recurrent Convolutional Networks for Multi-Frame Super-Resolution , 2015, NIPS.

[21]  Shao-Yi Chien,et al.  Occlusion-aware Video Temporal Consistency , 2017, ACM Multimedia.

[22]  Yang Song,et al.  Improving the Robustness of Deep Neural Networks via Stability Training , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[24]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[25]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Andreas Schilling,et al.  Creating cinematic wide gamut HDR-video for the evaluation of tone mapping operators and HDR-displays , 2014 .

[27]  Yair Weiss,et al.  Why do deep convolutional networks generalize so poorly to small image transformations? , 2018, J. Mach. Learn. Res..

[28]  Jürgen Schmidhuber,et al.  Simplifying Neural Nets by Discovering Flat Minima , 1994, NIPS.

[29]  Christian Ledig,et al.  Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Panos Nasiopoulos,et al.  Compression of high dynamic range video using the HEVC and H.264/AVC standards , 2014, 10th International Conference on Heterogeneous Networking for Quality, Reliability, Security and Robustness.

[31]  Daniel Cremers,et al.  Regularization for Deep Learning: A Taxonomy , 2017, ArXiv.

[32]  Jun Zhu,et al.  Video-to-Video Translation with Global Temporal Consistency , 2018, ACM Multimedia.

[33]  Li Fei-Fei,et al.  Characterizing and Improving Stability in Neural Style Transfer , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[35]  Thomas Brox,et al.  Artistic Style Transfer for Videos , 2016, GCPR.

[36]  Markus Gross,et al.  Practical temporal consistency for image-based graphics applications , 2012, ACM Trans. Graph..

[37]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[38]  Tal Hassner,et al.  Face recognition in unconstrained videos with matched background similarity , 2011, CVPR 2011.

[39]  Guillermo Sapiro,et al.  Deep Video Deblurring for Hand-Held Cameras , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[41]  Jeff B. Pelz,et al.  Spatio-velocity CSF as a function of retinal velocity using unstabilized stimuli , 2006, Electronic Imaging.

[42]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.