Multiclass semantic video segmentation with object-level active inference

We address the problem of integrating object reasoning with supervoxel labeling in multiclass semantic video segmentation. To this end, we first propose an object-augmented dense CRF in spatio-temporal domain, which captures long-range dependency between supervoxels, and imposes consistency between object and supervoxel labels. We develop an efficient mean field inference algorithm to jointly infer the supervoxel labels, object activations and their occlusion relations for a moderate number of object hypotheses. To scale up our method, we adopt an active inference strategy to improve the efficiency, which adaptively selects object subgraphs in the object-augmented dense CRF. We formulate the problem as a Markov Decision Process, which learns an approximate optimal policy based on a reward of accuracy improvement and a set of well-designed model and input features. We evaluate our method on three publicly available multiclass video semantic segmentation datasets and demonstrate superior efficiency and accuracy.

[1]  Bernt Schiele,et al.  A Dynamic Conditional Random Field Model for Joint Labeling of Object and Scene Classes , 2008, ECCV.

[2]  Philip H. S. Torr,et al.  What, Where and How Many? Combining Object Detectors and CRFs , 2010, ECCV.

[3]  James M. Rehg,et al.  Joint Semantic Segmentation and 3D Reconstruction from Monocular Video , 2014, ECCV.

[4]  Nikos Paragios,et al.  Segmentation, ordering and multi-object tracking using graphical models , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[5]  Luc Van Gool,et al.  Active MAP Inference in CRFs for Efficient Semantic Segmentation , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Silvio Savarese,et al.  Relating Things and Stuff via ObjectProperty Interactions , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Jason J. Corso,et al.  Temporally consistent multi-class video-object segmentation with the Video Graph-Shifts algorithm , 2011, 2011 IEEE Workshop on Applications of Computer Vision (WACV).

[8]  M. Hebert,et al.  Efficient temporal consistency for streaming video scene analysis , 2013, 2013 IEEE International Conference on Robotics and Automation.

[9]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[10]  Stephen Gould,et al.  Decomposing a scene into geometric and semantically consistent regions , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[11]  Xuming He,et al.  Scene understanding by labeling pixels , 2014, Commun. ACM.

[12]  Silvio Savarese,et al.  Relating Things and Stuff by High-Order Potential Modeling , 2012, ECCV Workshops.

[13]  Xuming He,et al.  Multi-class Semantic Video Segmentation with Exemplar-Based Object Reasoning , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[14]  John W. Fisher,et al.  A Video Representation Using Temporal Superpixels , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Svetlana Lazebnik,et al.  Scene Parsing with Object Instances and Occlusion Ordering , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Alexei A. Efros,et al.  Ensemble of exemplar-SVMs for object detection and beyond , 2011, 2011 International Conference on Computer Vision.

[17]  Roberto Cipolla,et al.  Segmentation and Recognition Using Structure from Motion Point Clouds , 2008, ECCV.

[18]  Andrew McCallum,et al.  Piecewise Training for Undirected Models , 2005, UAI.

[19]  Brian Taylor,et al.  Semantic Video Segmentation from Occlusion Relations within a Convex Optimization Framework , 2013, EMMCVPR.

[20]  Ivan Laptev,et al.  Track to the future: Spatio-temporal video segmentation with long-range motion cues , 2011, CVPR 2011.

[21]  Ben Taskar,et al.  Dynamic Structured Model Selection , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Richard S. Zemel,et al.  Learning and Incorporating Top-Down Cues in Image Segmentation , 2006, ECCV.

[23]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[24]  Ben Taskar,et al.  Learning Adaptive Value of Information for Structured Prediction , 2013, NIPS.

[25]  Pushmeet Kohli,et al.  Relating Things and Stuff via Object Property Interactions. , 2013, IEEE transactions on pattern analysis and machine intelligence.

[26]  Martial Hebert,et al.  SpeedMachines: Anytime Structured Prediction , 2013, ArXiv.

[27]  Bernt Schiele,et al.  Monocular 3D Scene Modeling and Inference: Understanding Multi-Object Traffic Scenes , 2010, ECCV.

[28]  Luc Van Gool,et al.  Learning Where to Classify in Multi-view Semantic Segmentation , 2014, ECCV.

[29]  Peter Kontschieder,et al.  GeoF: Geodesic Forests for Learning Coupled Predictors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[31]  Sanja Fidler,et al.  Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Ce Liu,et al.  Scene Collaging: Analysis and Synthesis of Natural Images with Semantic Layers , 2013, 2013 IEEE International Conference on Computer Vision.

[33]  Svetlana Lazebnik,et al.  Superparsing , 2010, International Journal of Computer Vision.