Learning depth from a single image using visual-depth words

Estimating depth from a single monocular image is a fundamental problem in computer vision. Traditional methods for such estimation usually require complicated and sometimes labor-intensive processing. In this paper, we propose a new perspective for this problem and suggest a new gradient-domain learning framework which is much simpler and more efficient. Inspired by the observation that there is substantial co-occurrence of image edges and depth discontinuities in natural scenes, we learn the relationship between local appearance features and corresponding depth gradients by making use of the K-means clustering algorithm within the image feature space. We then encode each cluster centroid with its associated depth gradients, which defines visual-depth words that model the image-depth relationship very well. This enables one to estimate the scene depth for an arbitrary image by simply selecting proper depth gradients from a compact dictionary of visual-depth words, followed by a Poisson surface reconstruction. Experimental results demonstrate that the proposed gradient-domain approach outperforms state-of-the-art methods both qualitatively and quantitatively and is generic over (unseen) scene categories which are not used for training.

[1]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[3]  Qi Zhang,et al.  100+ Times Faster Weighted Median Filter (WMF) , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Rama Chellappa,et al.  What Is the Range of Surface Reconstructions from a Gradient Field? , 2006, ECCV.

[5]  Kwanghoon Sohn,et al.  Data-driven single image depth estimation using weighted median statistics , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[6]  O. R. Vincent,et al.  A Descriptive Algorithm for Sobel Image Edge Detection , 2009 .

[7]  Ping-Sing Tsai,et al.  Shape from Shading: A Survey , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  B. Julesz,et al.  A disparity gradient limit for binocular fusion. , 1980, Science.

[9]  Ashutosh Saxena,et al.  Learning the right model: Efficient max-margin learning in Laplacian CRFs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Manfred Fahle,et al.  Perceived depth scales with disparity gradientt , 2004 .

[11]  M. Fahle,et al.  Perceived Depth Scales with Disparity Gradient , 1991, Perception.

[12]  Qionghai Dai,et al.  A Parametric Model for Describing the Correlation Between Single Color Images and Depth Maps , 2014, IEEE Signal Processing Letters.

[13]  Zhou Wang,et al.  Multiscale structural similarity for image quality assessment , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[14]  Ken-ichi Anjyo,et al.  Tour into the picture: using a spidery mesh interface to make animation from a single image , 1997, SIGGRAPH.

[15]  J J Koenderink,et al.  Affine structure from motion. , 1991, Journal of the Optical Society of America. A, Optics and image science.

[16]  Manuel Menezes de Oliveira Neto,et al.  Domain transform for edge-aware image and video processing , 2011, ACM Trans. Graph..

[17]  Ce Liu,et al.  Depth Transfer: Depth Extraction from Video Using Non-Parametric Sampling , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Meng Wang,et al.  Learning-Based, Automatic 2D-to-3D Image and Video Conversion , 2013, IEEE Transactions on Image Processing.

[19]  Qionghai Dai,et al.  DEPT: Depth Estimation by Parameter Transfer for Single Still Images , 2014, ACCV.