Robust Semantic Video Segmentation through Confidence-based Feature Map Warping

One of the limiting factors when using deep learning methods in the field of highly automated driving is their lack of robustness. Objects that suddenly appear or disappear from one image to another due to inaccurate predictions as well as occurring perturbations in the input data can have devastating consequences. A possibility to increase model robustness is the use of temporal consistency in video data. Our approach aims for a confidence-based combination of feature maps that are warped from previous time stages into the current one. This enables us to stabilize the network prediction and increase its robustness against perturbations. In order to demonstrate the effectiveness of our approach, we have created a test data set with image perturbations such as image artifacts and adversarial examples in which we significantly outperform the baseline.

[1]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Truong Q. Nguyen,et al.  Semantic video segmentation: Exploring inference efficiency , 2015, 2015 International SoC Design Conference (ISOCC).

[3]  Xuming He,et al.  Multi-class Semantic Video Segmentation with Exemplar-Based Object Reasoning , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[4]  Noah Snavely,et al.  Material recognition in the wild with the Materials in Context Database , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Eugenio Culurciello,et al.  ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation , 2016, ArXiv.

[6]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[10]  Garrison W. Cottrell,et al.  Understanding Convolution for Semantic Segmentation , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[11]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Trevor Darrell,et al.  Clockwork Convnets for Video Semantic Segmentation , 2016, ECCV Workshops.

[13]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[14]  Lorenzo Torresani,et al.  Deep End2End Voxel2Voxel Prediction , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  Vladlen Koltun,et al.  Feature Space Optimization for Semantic Video Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Alan L. Yuille,et al.  Adversarial Examples for Semantic Segmentation and Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Peter V. Gehler,et al.  Semantic Video CNNs Through Representation Warping , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Cristian Sminchisescu,et al.  Semantic Video Segmentation by Gated Recurrent Flow Propagation , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Mahmood Fathy,et al.  STFCN: Spatio-Temporal FCN for Semantic Video Segmentation , 2016, ArXiv.

[21]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[22]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.