论文信息 - Search-Based Depth Estimation via Coupled Dictionary Learning with Large-Margin Structure Inference

Search-Based Depth Estimation via Coupled Dictionary Learning with Large-Margin Structure Inference

Depth estimation from a single image is an emerging topic in computer vision and beyond. To this end, the existing works typically train a depth regressor from visual appearance. However, the state-of-the-art performance of these schemes is still far from satisfactory, mainly because of the over-fitting and under-fitting problems in regressor training. In this paper, we offer a different data-driven paradigm of estimating depth from a single image, which formulates depth estimation from a search-based perspective. In particular, we handle the depth estimation of local patches via a novel cross-modality retrieval scheme, which searches for the 3D patches with similar structure/appearance to the 2D query from a dataset with 2D-3D mappings. To that effect, a coupled dictionary learning formulation is proposed to link the 2D query with the 3D patches, on the reconstruction coefficients to capture the cross-modality similarity, to obtain a rough depth estimation locally. In addition, consistency on spatial context is further introduced to refine the local depth estimation using a Conditional Random Field. We demonstrate the efficacy of the proposed method by comparing it with the state-of-the-art approaches on popular public datasets such as Make3D and NYUv2, upon which significant performance gains are reported.

[1] Yueting Zhuang,et al. Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval , 2013, AAAI.

[2] Quan Pan,et al. Semi-coupled dictionary learning with applications to image super-resolution and photo-sketch synthesis , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[3] Tsuhan Chen,et al. Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models , 2010, NIPS.

[4] Chunhua Shen,et al. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Ce Liu,et al. Depth Transfer: Depth Extraction from Video Using Non-Parametric Sampling , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6] Nitish Srivastava,et al. Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[7] Frank Dellaert,et al. Structure from motion without correspondence , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[8] Meng Wang,et al. 2D-to-3D image conversion by learning depth from examples , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[9] Rob Fergus,et al. Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[10] Richard Bellman,et al. Introduction to Matrix Analysis , 1972 .

[11] Meng Wang,et al. Automatic 2D-to-3D image conversion using 3D examples from the internet , 2012, Electronic Imaging.

[12] Ashutosh Saxena,et al. Learning Depth from Single Monocular Images , 2005, NIPS.

[13] Gaofeng Meng,et al. Image Deblurring with Coupled Dictionary Learning , 2015, International Journal of Computer Vision.

[14] Daniel Cremers,et al. LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[15] Marc Pollefeys,et al. Pulling Things out of Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16] Shih-Fu Chang,et al. Discriminative Indexing for Probabilistic Image Patch Priors , 2014, ECCV.

[17] Stephen Gould,et al. Single image depth estimation from predicted semantic labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18] Yueting Zhuang,et al. Multi-modal Mutual Topic Reinforce Modeling for Cross-media Retrieval , 2014, ACM Multimedia.

[19] Antonio Torralba,et al. Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[20] Antonio Torralba,et al. SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21] Xuming He,et al. Discrete-Continuous Depth Estimation from a Single Image , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22] Anil K. Jain,et al. Markov random fields : theory and application , 1993 .

[23] Marcus Liwicki,et al. Scene labeling with LSTM recurrent neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Ce Liu,et al. Depth Extraction from Video Using Non-parametric Sampling , 2012, ECCV.

[25] Patrick Rives,et al. An Efficient Direct Approach to Visual SLAM , 2008, IEEE Transactions on Robotics.

[26] KeeChang Lee,et al. Fast Automatic Single-View 3-d Reconstruction of Urban Scenes , 2008, ECCV.

[27] Guosheng Lin,et al. Deep convolutional neural fields for depth estimation from a single image , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Thomas S. Huang,et al. Coupled Dictionary Training for Image Super-Resolution , 2012, IEEE Transactions on Image Processing.

[29] Roger Levy,et al. A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[30] David W. Jacobs,et al. Deep hierarchical parsing for semantic segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Russell Zaretzki,et al. Beta Process Joint Dictionary Learning for Coupled Feature Spaces with Application to Single Image Super-Resolution , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[32] Derek Hoiem,et al. Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[33] Rama Chellappa,et al. Joint Sparse Representation for Robust Multimodal Biometrics Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34] ψψAABB xxAA,et al. Markov Random Fields , 1982, Encyclopedia of Social Network Analysis and Mining.

[35] Alexei A. Efros,et al. Geometric context from a single image , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[36] Derek Hoiem,et al. Completing 3D object shape from one depth image , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Ba-Ngu Vo,et al. A Random-Finite-Set Approach to Bayesian SLAM , 2011, IEEE Transactions on Robotics.