Joint Depth and Semantic Inference from a Single Image via Elastic Conditional Random Field

The estimations of depth and regional semantics from a single image have traditionally been considered as two separated problems. In this paper, we argue that these two tasks provide complementary information, which therefore can be performed jointly to reinforce individual tasks in terms of both accuracy and speed. In particular, we propose an Elastic Conditional Random Field (E-CRF) deployed upon superpixel segmentations, which models the interdependency between depth and semantics to refine each other in an iterative manner. Differing from the traditional CRFs, E-CRF makes edges elastically hidden/emergent during inference to conduct fast Loopy Belief Propagation, while explicitly modeling the depth-label interdependency to achieve high inference accuracy. Moreover, the Structured Support Vector Machine (SSVM) is further introduced to drastically speed up the inference. We have conducted extensive evaluations on both Make3D and NYU benchmark datasets, which demonstrated that our E-CRF method significantly outperforms state-of-the-art techniques in terms of precision, while significantly accelerating the inference speed (2-3 orders of magnitude). HighlightsEfficient joint inference of depth estimation and region labeling from a single image.Our first contribution is an efficient generative model called Elastic Conditional Random Field (E-CRF) to capture the interdependency between depth and labeling, along with the spatial dependency among neighborhood superpixels.Our second contribution is to further accelerate the above LBP-based generative inference from a large-margin perspective by using a Structured Support Vector Machine (SSVM).

[1]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[3]  Ce Liu,et al.  Depth Transfer: Depth Extraction from Video Using Non-Parametric Sampling , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[5]  Alexei A. Efros,et al.  Closing the loop in scene interpretation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  W. F. Clocksin,et al.  Joint Optimization for Object Class Segmentation and Dense Stereo Reconstruction , 2012, International Journal of Computer Vision.

[9]  Tsuhan Chen,et al.  $\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding , 2011, NIPS.

[10]  Toby Sharp,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[11]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[12]  Abhinav Gupta,et al.  Designing deep networks for surface normal estimation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Liang Lin,et al.  Video Stylization: Painterly Rendering and Optimization With Content Extraction , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[14]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.

[15]  Ashutosh Saxena,et al.  Cascaded Classification Models: Combining Models for Holistic Scene Understanding , 2008, NIPS.

[16]  Thorsten Joachims,et al.  Training structural SVMs when exact inference is intractable , 2008, ICML '08.

[17]  Ruimao Zhang,et al.  Adaptive Scene Category Discovery With Generative Learning and Compositional Sampling , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  Jian-Huang Lai,et al.  Discriminatively Trained And-Or Graph Models for Object Shape Detection , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Pietro Perona,et al.  A discriminative framework for modelling object classes , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[20]  Sanja Fidler,et al.  Holistic 3D scene understanding from a single geo-tagged image , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Stephen Gould,et al.  Single image depth estimation from predicted semantic labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  Daphna Weinshall,et al.  Efficient Learning of Relational Object Class Models , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[23]  Ian Reid,et al.  gSLIC: a real-time implementation of SLIC superpixel segmentation , 2011 .

[24]  Martial Hebert,et al.  Discriminative random fields: a discriminative framework for contextual interaction in classification , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[25]  Tamir Hazan,et al.  A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction , 2010, NIPS.

[26]  Svetlana Lazebnik,et al.  Understanding scenes on many levels , 2011, 2011 International Conference on Computer Vision.

[27]  Nathan Silberman,et al.  Indoor scene segmentation using a structured light sensor , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[28]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Stephen Gould,et al.  Decomposing a scene into geometric and semantically consistent regions , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[30]  Tsuhan Chen,et al.  Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models , 2010, NIPS.

[31]  Ashutosh Saxena,et al.  3-D Depth Reconstruction from a Single Still Image , 2007, International Journal of Computer Vision.

[32]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[33]  Xuming He,et al.  Discrete-Continuous Depth Estimation from a Single Image , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Marc Pollefeys,et al.  Pulling Things out of Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Thorsten Joachims,et al.  Semantic Labeling of 3D Point Clouds for Indoor Scenes , 2011, NIPS.