Single image depth estimation based on convolutional neural network and sparse connected conditional random field

Abstract. Deep convolutional neural networks (DCNNs) have attracted significant interest in the computer vision community in the recent years and have exhibited high performance in resolving many computer vision problems, such as image classification. We address the pixel-level depth prediction from a single image by combining DCNN and sparse connected conditional random field (CRF). Owing to the invariance properties of DCNNs that make them suitable for high-level tasks, their outputs are generally not localized enough for detailed pixel-level regression. A multiscale DCNN and sparse connected CRF are combined to overcome this localization weakness. We have evaluated our framework using the well-known NYU V2 depth dataset, and the results show that the proposed method can improve the depth prediction accuracy both qualitatively and quantitatively, as compared to previous works. This finding shows the potential use of the proposed method in three-dimensional (3-D) modeling or 3-D video production from the given two-dimensional (2-D) images or 2-D videos.

[1]  Stephen Lin,et al.  Coded aperture pairs for depth from defocus , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[2]  Pascal Fua,et al.  SLIC Superpixels Compared to State-of-the-Art Superpixel Methods , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Pushmeet Kohli,et al.  Robust Higher Order Potentials for Enforcing Label Consistency , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Roberto Cipolla,et al.  Multiview Photometric Stereo , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Stephen Gould,et al.  Single image depth estimation from predicted semantic labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[7]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[8]  Berthold K. P. Horn Obtaining shape from shading information , 1989 .

[9]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[10]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[12]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Robert J. Woodham,et al.  Photometric method for determining surface orientation from multiple images , 1980 .

[15]  Diego F. Nehab,et al.  Efficiently combining positions and normals for precise 3D geometry , 2005, SIGGRAPH 2005.

[16]  Sam Kwong,et al.  Efficient Motion and Disparity Estimation Optimization for Low Complexity Multiview Video Coding , 2015, IEEE Transactions on Broadcasting.

[17]  S. Nayar,et al.  What are good apertures for defocus deblurring? , 2009, 2009 IEEE International Conference on Computational Photography (ICCP).

[18]  Ashutosh Saxena,et al.  3-D Depth Reconstruction from a Single Still Image , 2007, International Journal of Computer Vision.

[19]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[20]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[21]  Carl E. Rasmussen,et al.  Learning Depth from Stereo , 2004, DAGM-Symposium.

[22]  Hujun Bao,et al.  Consistent Depth Maps Recovery from a Video Sequence , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Richard Szeliski,et al.  A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms , 2001, International Journal of Computer Vision.

[24]  Frédo Durand,et al.  Image and depth from a conventional camera with a coded aperture , 2007, SIGGRAPH 2007.

[25]  Guosheng Lin,et al.  Deep convolutional neural fields for depth estimation from a single image , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).