Budget-Aware Deep Semantic Video Segmentation

In this work, we study a poorly understood trade-off between accuracy and runtime costs for deep semantic video segmentation. While recent work has demonstrated advantages of learning to speed-up deep activity detection, it is not clear if similar advantages will hold for our very different segmentation loss function, which is defined over individual pixels across the frames. In deep video segmentation, the most time consuming step represents the application of a CNN to every frame for assigning class labels to every pixel, typically taking 6-9 times of the video footage. This motivates our new budget-aware framework that learns to optimally select a small subset of frames for pixelwise labeling by a CNN, and then efficiently interpolates the obtained segmentations to yet unprocessed frames. This interpolation may use either a simple optical-flow guided mapping of pixel labels, or another significantly less complex and thus faster CNN. We formalize the frame selection as a Markov Decision Process, and specify a Long Short-Term Memory (LSTM) network to model a policy for selecting the frames. For training the LSTM, we develop a policy-gradient reinforcement-learning approach for approximating the gradient of our non-decomposable and non-differentiable objective. Evaluation on two benchmark video datasets show that our new framework is able to significantly reduce computation time, and maintain competitive video segmentation accuracy under varying budgets.

[1]  Olga Veksler,et al.  Fast approximate energy minimization via graph cuts , 2001, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[2]  Olga Veksler,et al.  Fast Approximate Energy Minimization via Graph Cuts , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[4]  Roberto Cipolla,et al.  Assisted Video Object Labeling By Joint Tracking of Regions and Keypoints , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[5]  Roberto Cipolla,et al.  Segmentation and Recognition Using Structure from Motion Point Clouds , 2008, ECCV.

[6]  Bernt Schiele,et al.  A Dynamic Conditional Random Field Model for Joint Labeling of Object and Scene Classes , 2008, ECCV.

[7]  Antonio Torralba,et al.  Nonparametric scene parsing: Label transfer via dense scene alignment , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Jürgen Schmidhuber,et al.  Recurrent policy gradients , 2010, Log. J. IGPL.

[9]  Ignas Budvytis,et al.  Label propagation in complex video sequences using semi-supervised learning , 2010, BMVC.

[10]  Jason J. Corso,et al.  Propagating multi-class pixel labels throughout video frames , 2010, 2010 Western New York Image Processing Workshop.

[11]  Jason J. Corso,et al.  Temporally consistent multi-class video-object segmentation with the Video Graph-Shifts algorithm , 2011, 2011 IEEE Workshop on Applications of Computer Vision (WACV).

[12]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[13]  Kristen Grauman,et al.  Active Frame Selection for Label Propagation in Videos , 2012, ECCV.

[14]  Tsuhan Chen,et al.  Efficient inference for fully-connected CRFs with stationarity , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Mohamed R. Amer,et al.  Cost-Sensitive Top-Down/Bottom-Up Inference for Multiscale Activity Recognition , 2012, ECCV.

[17]  Luc Van Gool,et al.  Active MAP Inference in CRFs for Efficient Semantic Segmentation , 2013, 2013 IEEE International Conference on Computer Vision.

[18]  Brian Taylor,et al.  Semantic Video Segmentation from Occlusion Relations within a Convex Optimization Framework , 2013, EMMCVPR.

[19]  M. Hebert,et al.  Efficient temporal consistency for streaming video scene analysis , 2013, 2013 IEEE International Conference on Robotics and Automation.

[20]  Martial Hebert,et al.  SpeedMachines: Anytime Structured Prediction , 2013, ArXiv.

[21]  Song-Chun Zhu,et al.  Learning Near-Optimal Cost-Sensitive Decision Policy for Object Detection , 2015, 2013 IEEE International Conference on Computer Vision.

[22]  Ronan Collobert,et al.  Recurrent Convolutional Neural Networks for Scene Parsing , 2013, ArXiv.

[23]  Ben Taskar,et al.  Learning Adaptive Value of Information for Structured Prediction , 2013, NIPS.

[24]  René Vidal,et al.  Coarse-to-Fine Semantic Video Segmentation Using Supervoxel Trees , 2013, 2013 IEEE International Conference on Computer Vision.

[25]  Ben Taskar,et al.  Dynamic Structured Model Selection , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Gerhard Reinelt,et al.  Towards Efficient and Exact MAP-Inference for Large Scale Discrete Computer Vision Problems via Combinatorial Optimization , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Benjamin M. Marlin,et al.  The Shape-Time Random Field for Semantic Video Labeling , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Ronan Collobert,et al.  Recurrent Convolutional Neural Networks for Scene Labeling , 2014, ICML.

[29]  Xiao Liu,et al.  Weakly Supervised Multiclass Video Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Trevor Darrell,et al.  Anytime Recognition of Objects and Scenes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[32]  Xuming He,et al.  Multi-class Semantic Video Segmentation with Exemplar-Based Object Reasoning , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[33]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[34]  Jitendra Malik,et al.  Hypercolumns for object segmentation and fine-grained localization , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Trevor Darrell,et al.  Fully convolutional networks for semantic segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[37]  Damien Ernst,et al.  How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies , 2015, ArXiv.

[38]  Xuming He,et al.  Multiclass semantic video segmentation with object-level active inference , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[41]  한보형,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015 .

[42]  Xiaolin Hu,et al.  Convolutional Neural Networks with Intra-Layer Recurrent Connections for Scene Labeling , 2015, NIPS.

[43]  Li Fei-Fei,et al.  End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[45]  Yi Yang,et al.  Attention to Scale: Scale-Aware Semantic Image Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Trevor Darrell,et al.  Clockwork Convnets for Video Semantic Segmentation , 2016, ECCV Workshops.

[47]  Sinisa Todorovic,et al.  Recurrent Temporal Deep Field for Semantic Video Labeling , 2016, ECCV.

[48]  Alan Fern,et al.  Approximate Policy Iteration for Budgeted Semantic Video Segmentation , 2016, ArXiv.

[49]  Xuming He,et al.  Learning Dynamic Hierarchical Models for Anytime Scene Labeling , 2016, ECCV.

[50]  Roberto Cipolla,et al.  Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding , 2015, BMVC.

[51]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.