Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs

Predicting the depth (or surface normal) of a scene from single monocular color images is a challenging task. This paper tackles this challenging and essentially underdetermined problem by regression on deep convolutional neural network (DCNN) features, combined with a post-processing refining step using conditional random fields (CRF). Our framework works at two levels, super-pixel level and pixel level. First, we design a DCNN model to learn the mapping from multi-scale image patches to depth or surface normal values at the super-pixel level. Second, the estimated super-pixel depth or surface normal is refined to the pixel level by exploiting various potentials on the depth or surface normal map, which includes a data term, a smoothness term among super-pixels and an auto-regression term characterizing the local structure of the estimation map. The inference problem can be efficiently solved because it admits a closed-form solution. Experiments on the Make3D and NYU Depth V2 datasets show competitive results compared with recent state-of-the-art methods.

[1]  Marc Pollefeys,et al.  Pulling Things out of Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Meng Wang,et al.  2D-to-3D image conversion by learning depth from examples , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Marc Pollefeys,et al.  Discriminatively Trained Dense Surface Normal Estimation , 2014, ECCV.

[5]  Gabriel J. Brostow,et al.  Patch Based Synthesis for Single Depth Image Super-Resolution , 2012, ECCV.

[6]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[7]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[8]  Jan-Michael Frahm,et al.  Repetition-based dense single-view reconstruction , 2011, CVPR 2011.

[9]  Stephen Gould,et al.  Single image depth estimation from predicted semantic labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Ping-Sing Tsai,et al.  Shape from Shading: A Survey , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Sebastian Thrun,et al.  An Application of Markov Random Fields to Range Sensing , 2005, NIPS.

[13]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[14]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[15]  Dieter Fox,et al.  RGB-(D) scene labeling: Features and algorithms , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Alexei A. Efros,et al.  Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics , 2010, ECCV.

[17]  Ce Liu,et al.  Depth Extraction from Video Using Non-parametric Sampling , 2012, ECCV.

[18]  Ashutosh Saxena,et al.  3-D Depth Reconstruction from a Single Still Image , 2007, International Journal of Computer Vision.

[19]  Martial Hebert,et al.  Unfolding an Indoor Origami World , 2014, ECCV.

[20]  Stefano Soatto,et al.  Class segmentation and object localization with superpixel neighborhoods , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[21]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[23]  Xuming He,et al.  Discrete-Continuous Depth Estimation from a Single Image , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  David Sweeney,et al.  Learning to be a depth camera for close-range human capture and interaction , 2014, ACM Trans. Graph..

[25]  T. Lindeberg,et al.  Scale-Space Theory : A Basic Tool for Analysing Structures at Different Scales , 1994 .

[26]  Martial Hebert,et al.  Data-Driven 3D Primitives for Single Image Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[27]  Dani Lischinski,et al.  Colorization using optimization , 2004, ACM Trans. Graph..

[28]  Pascal Fua,et al.  SLIC Superpixels Compared to State-of-the-Art Superpixel Methods , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Ashutosh Saxena,et al.  Depth Estimation Using Monocular and Stereo Cues , 2007, IJCAI.