Perceptual Consistency in Video Segmentation

In this paper, we present a novel perceptual consistency perspective on video semantic segmentation, which can capture both temporal consistency and pixel-wise correctness. Given two nearby video frames, perceptual consistency measures how much the segmentation decisions agree with the pixel correspondences obtained via matching general perceptual features. More specifically, for each pixel in one frame, we find the most perceptually correlated pixel in the other frame. Our intuition is that such a pair of pixels are highly likely to belong to the same class. Next, we assess how much the segmentation agrees with such perceptual correspondences, based on which we derive the perceptual consistency of the segmentation maps across these two frames. Utilizing perceptual consistency, we can evaluate the temporal consistency of video segmentation by measuring the perceptual consistency over consecutive pairs of segmentation maps in a video. Furthermore, given a sparsely labeled test video, perceptual consistency can be utilized to aid with predicting the pixel-wise correctness of the segmentation on an unlabeled frame. More specifically, by measuring the perceptual consistency between the predicted segmentation and the available ground truth on a nearby frame and combining it with the segmentation confidence, we can accurately assess the classification correctness on each pixel. Our experiments show that the proposed perceptual consistency can more accurately evaluate the temporal consistency of video segmentation as compared to flow-based measures. Furthermore, it can help more confidently predict segmentation accuracy on unlabeled test frames, as compared to using classification confidence alone. Finally, our proposed measure can be used as a regularizer during the training of segmentation models, which leads to more temporally *Work done at Qualcomm AI Research. †Nanjing University of Science and Technology, Nanjing, China ‡Qualcomm AI Research, an initiative of Qualcomm Technologies, Inc. §Equal contribution. ¶Qualcomm Technologies, Inc. consistent video segmentation while maintaining accuracy.

[1]  Zhe L. Lin,et al.  Temporally Distributed Networks for Fast Video Semantic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Guosheng Lin,et al.  Video Object Segmentation and Tracking: A Survey , 2019, ArXiv.

[3]  Olaf Booij,et al.  Exploiting Temporality for Semi-Supervised Video Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[4]  Ning Xu,et al.  Video Object Segmentation Using Space-Time Memory Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Ling Shao,et al.  RANet: Ranking Attention Network for Fast Video Object Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Zhiwu Lu,et al.  Every Frame Counts: Joint Learning of Video Segmentation and Optical Flow , 2019, AAAI.

[7]  Cristian Sminchisescu,et al.  Semantic Video Segmentation by Gated Recurrent Flow Propagation , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Yang Zhao,et al.  Deep High-Resolution Representation Learning for Visual Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Karan Sapra,et al.  Hierarchical Multi-Scale Attention for Semantic Segmentation , 2020, ArXiv.

[11]  Antonio J. Plaza,et al.  Image Segmentation Using Deep Learning: A Survey , 2021, IEEE transactions on pattern analysis and machine intelligence.

[12]  Andreas Geiger,et al.  Object scene flow for autonomous vehicles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Patrick Knöbelreiter,et al.  Frame-To-Frame Consistent Semantic Segmentation , 2020, ArXiv.

[14]  Stefan Roth,et al.  Joint Optical Flow and Temporally Consistent Semantic Segmentation , 2016, ECCV Workshops.

[15]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[16]  Dahua Lin,et al.  Low-Latency Video Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[18]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[19]  James M. Rehg,et al.  Joint Semantic Segmentation and 3D Reconstruction from Monocular Video , 2014, ECCV.

[20]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[21]  Yujie Zhang,et al.  Triple attention network for video segmentation , 2020, Neurocomputing.

[22]  Chunhua Shen,et al.  Efficient Semantic Video Segmentation with Per-frame Inference , 2020, ECCV.

[23]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Hao Chen,et al.  Architecture Search of Dynamic Cells for Semantic Video Segmentation , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[25]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[26]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[29]  Rini Sherony,et al.  MIT DriveSeg (Manual) Dataset for Dynamic Driving Scene Segmentation , 2020 .

[30]  Bastian Leibe,et al.  Joint 2D-3D temporally consistent semantic segmentation of street scenes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Roberto Cipolla,et al.  Segmentation and Recognition Using Structure from Motion Point Clouds , 2008, ECCV.

[32]  Vladlen Koltun,et al.  Feature Space Optimization for Semantic Video Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Peter Schlicht,et al.  Unsupervised Temporal Consistency Metric for Video Segmentation in Highly-Automated Driving , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[35]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[36]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.