Depth Analogy: Data-Driven Approach for Single Image Depth Estimation Using Gradient Samples

Inferring scene depth from a single monocular image is a highly ill-posed problem in computer vision. This paper presents a new gradient-domain approach, called depth analogy, that makes use of analogy as a means for synthesizing a target depth field, when a collection of RGB-D image pairs is given as training data. Specifically, the proposed method employs a non-parametric learning process that creates an analogous depth field by sampling reliable depth gradients using visual correspondence established on training image pairs. Unlike existing data-driven approaches that directly select depth values from training data, our framework transfers depth gradients as reconstruction cues, which are then integrated by the Poisson reconstruction. The performance of most conventional approaches relies heavily on the training RGB-D data used in the process, and such a dependency severely degenerates the quality of reconstructed depth maps when the desired depth distribution of an input image is quite different from that of the training data, e.g., outdoor versus indoor scenes. Our key observation is that using depth gradients in the reconstruction is less sensitive to scene characteristics, providing better cues for depth recovery. Thus, our gradient-domain approach can support a great variety of training range datasets that involve substantial appearance and geometric variations. The experimental results demonstrate that our (depth) gradient-domain approach outperforms existing data-driven approaches directly working on depth domain, even when only uncorrelated training datasets are available.

[1]  Qi Zhang,et al.  100+ Times Faster Weighted Median Filter (WMF) , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Rama Chellappa,et al.  What Is the Range of Surface Reconstructions from a Gradient Field? , 2006, ECCV.

[3]  David Mumford,et al.  Statistics of range images , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[4]  Stephen Gould,et al.  Single image depth estimation from predicted semantic labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  B. Julesz,et al.  A disparity gradient limit for binocular fusion. , 1980, Science.

[6]  Richard Szeliski,et al.  High-accuracy stereo depth maps using structured light , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[7]  Enhua Wu,et al.  Constant Time Weighted Median Filtering for Stereo Matching and Beyond , 2013, 2013 IEEE International Conference on Computer Vision.

[8]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[9]  M. Fahle,et al.  Perceived Depth Scales with Disparity Gradient , 1991, Perception.

[10]  David Salesin,et al.  Image Analogies , 2001, SIGGRAPH.

[11]  Dorin Comaniciu,et al.  Real-time tracking of non-rigid objects using mean shift , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[12]  Heiko Hirschmüller,et al.  Evaluation of Cost Functions for Stereo Matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Eli Shechtman,et al.  PatchMatch: a randomized correspondence algorithm for structural image editing , 2009, ACM Trans. Graph..

[14]  Ashutosh Saxena,et al.  Learning the right model: Efficient max-margin learning in Laplacian CRFs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Qionghai Dai,et al.  Semi-Automatic 2D-to-3D Conversion Using Disparity Propagation , 2011, IEEE Transactions on Broadcasting.

[16]  Michael J. Black,et al.  Secrets of optical flow estimation and their principles , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  Frédo Durand,et al.  A Fast Approximation of the Bilateral Filter Using a Signal Processing Approach , 2006, ECCV.

[18]  Ce Liu,et al.  Depth Transfer: Depth Extraction from Video Using Non-Parametric Sampling , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Martial Hebert,et al.  Data-Driven 3D Primitives for Single Image Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[20]  Tai-Pang Wu,et al.  Surface-from-Gradients without Discrete Integrability Enforcement: A Gaussian Kernel Approach , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Marc Pollefeys,et al.  Pulling Things out of Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[23]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Alexei A. Efros,et al.  Fast bilateral filtering for the display of high-dynamic-range images , 2002 .

[25]  Carlos Vázquez,et al.  3D-TV Content Creation: Automatic 2D-to-3D Video Conversion , 2011, IEEE Transactions on Broadcasting.

[26]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[27]  Xuming He,et al.  Discrete-Continuous Depth Estimation from a Single Image , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Minh N. Do,et al.  Probability-Based Rendering for View Synthesis , 2014, IEEE Transactions on Image Processing.

[29]  Miao Liao,et al.  Video Stereolization: Combining Motion Analysis with User Interaction , 2012, IEEE Transactions on Visualization and Computer Graphics.

[30]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[31]  Kwanghoon Sohn,et al.  Space-Time Hole Filling With Random Walks in View Extrapolation for 3D Video , 2013, IEEE Transactions on Image Processing.

[32]  Manfred Fahle,et al.  Perceived depth scales with disparity gradientt , 2004 .

[33]  S. Osher,et al.  A new median formula with applications to PDE based denoising , 2009 .

[34]  Antonio Torralba,et al.  Depth Estimation from Image Structure , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Andrew Zisserman,et al.  Representing shape with a spatial pyramid kernel , 2007, CIVR '07.

[36]  Marc Pollefeys,et al.  Discriminatively Trained Dense Surface Normal Estimation , 2014, ECCV.

[37]  Meng Wang,et al.  Learning-Based, Automatic 2D-to-3D Image and Video Conversion , 2013, IEEE Transactions on Image Processing.

[38]  Christopher Joseph Pal,et al.  Learning Conditional Random Fields for Stereo , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Zhou Wang,et al.  Multiscale structural similarity for image quality assessment , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[40]  Ken-ichi Anjyo,et al.  Tour into the picture: using a spidery mesh interface to make animation from a single image , 1997, SIGGRAPH.

[41]  Markus H. Gross,et al.  StereoBrush: interactive 2D to 3D conversion using discontinuous warps , 2011, SBIM '11.

[42]  Adam Finkelstein,et al.  The Generalized PatchMatch Correspondence Algorithm , 2010, ECCV.

[43]  Ping-Sing Tsai,et al.  Shape from Shading: A Survey , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Minh N. Do,et al.  Depth Video Enhancement Based on Weighted Mode Filtering , 2012, IEEE Transactions on Image Processing.

[45]  Richard Szeliski,et al.  A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms , 2001, International Journal of Computer Vision.

[46]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[47]  Richard F. Gunst,et al.  Applied Regression Analysis , 1999, Technometrics.

[48]  Ce Liu,et al.  Depth Extraction from Video Using Non-parametric Sampling , 2012, ECCV.

[49]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[50]  Meng Wang,et al.  Automatic 2D-to-3D image conversion using 3D examples from the internet , 2012, Electronic Imaging.

[51]  Steven M. Seitz,et al.  Single-view modelling of free-form scenes , 2002, Comput. Animat. Virtual Worlds.

[52]  Kwanghoon Sohn,et al.  A Stereoscopic Video Generation Method Using Stereoscopic Display Characterization and Motion Analysis , 2008, IEEE Transactions on Broadcasting.