Bayesian depth estimation from monocular natural images.

Estimating an accurate and naturalistic dense depth map from a single monocular photographic image is a difficult problem. Nevertheless, human observers have little difficulty understanding the depth structure implied by photographs. Two-dimensional (2D) images of the real-world environment contain significant statistical information regarding the three-dimensional (3D) structure of the world that the vision system likely exploits to compute perceived depth, monocularly as well as binocularly. Toward understanding how this might be accomplished, we propose a Bayesian model of monocular depth computation that recovers detailed 3D scene structures by extracting reliable, robust, depth-sensitive statistical features from single natural images. These features are derived using well-accepted univariate natural scene statistics (NSS) models and recent bivariate/correlation NSS models that describe the relationships between 2D photographic images and their associated depth maps. This is accomplished by building a dictionary of canonical local depth patterns from which NSS features are extracted as prior information. The dictionary is used to create a multivariate Gaussian mixture (MGM) likelihood model that associates local image features with depth patterns. A simple Bayesian predictor is then used to form spatial depth estimates. The depth results produced by the model, despite its simplicity, correlate well with ground-truth depths measured by a current-generation terrestrial light detection and ranging (LIDAR) scanner. Such a strong form of statistical depth information could be used by the visual system when creating overall estimated depth maps incorporating stereopsis, accommodation, and other conditions. Indeed, even in isolation, the Bayesian predictor delivers depth estimates that are competitive with state-of-the-art "computer vision" methods that utilize highly engineered image features and sophisticated machine learning algorithms.

[1]  Atsuto Maki,et al.  Geotensity: combining motion and lighting for 3D surface reconstruction , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[2]  Alan C. Bovik,et al.  Closed-Form Correlation Model of Oriented Bandpass Natural Images , 2015, IEEE Signal Process. Lett..

[3]  Eero P. Simoncelli,et al.  Natural image statistics and divisive normalization: Modeling nonlinearity and adaptation in cortical neurons , 2002 .

[4]  Siwei Lyu,et al.  Dependency Reduction with Divisive Normalization: Justification and Effectiveness , 2011, Neural Computation.

[5]  Alan C. Bovik,et al.  Image information and visual quality , 2006, IEEE Trans. Image Process..

[6]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[7]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[8]  Alan C. Bovik,et al.  Experiments in segmenting texton patterns using localized spatial filters , 1989, Pattern Recognit..

[9]  Zhou Wang,et al.  Reduced- and No-Reference Image Quality Assessment , 2011, IEEE Signal Processing Magazine.

[10]  Stephen Gould,et al.  Single image depth estimation from predicted semantic labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  G. Sperling,et al.  Luminance controls the perceived 3-D structure of dynamic 2-D displays. , 1983 .

[12]  Narendra Ahuja,et al.  Performance Analysis of Stereo, Vergence, and Focus as Depth Cues for Active Vision , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[14]  Stéphane Mallat,et al.  Multifrequency channel decompositions of images and wavelet models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[15]  Eero P. Simoncelli,et al.  A Parametric Texture Model Based on Joint Statistics of Complex Wavelet Coefficients , 2000, International Journal of Computer Vision.

[16]  R. Schumer,et al.  Independent stereoscopic channels for different extents of spatial pooling , 1979, Vision Research.

[17]  Robinson Piramuthu,et al.  Im2depth: Scalable exemplar based depth transfer , 2014, IEEE Winter Conference on Applications of Computer Vision.

[18]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[19]  William T. Freeman,et al.  Presented at: 2nd Annual IEEE International Conference on Image , 1995 .

[20]  D. Scharstein,et al.  A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms , 2001, Proceedings IEEE Workshop on Stereo and Multi-Baseline Vision (SMBV 2001).

[21]  Alan C. Bovik,et al.  New bivariate statistical model of natural image correlations , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Ce Liu,et al.  Depth Extraction from Video Using Non-parametric Sampling , 2012, ECCV.

[23]  Alan C. Bovik,et al.  Automatic Prediction of Perceptual Image and Video Quality , 2013, Proceedings of the IEEE.

[24]  Alan C. Bovik,et al.  Improved initial approximation and intensity-guided discontinuity detection in visible-surface reconstruction , 1989, Comput. Vis. Graph. Image Process..

[25]  Martial Hebert,et al.  Data-Driven 3D Primitives for Single Image Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Ashutosh Saxena,et al.  Learning Depth from Single Monocular Images , 2005, NIPS.

[27]  G. Sperling,et al.  Tradeoffs between stereopsis and proximity luminance covariance as determinants of perceived 3D structure , 1986, Vision Research.

[28]  D. Heeger Normalization of cell responses in cat striate cortex , 1992, Visual Neuroscience.

[29]  Eero P. Simoncelli Modeling the joint statistics of images in the wavelet domain , 1999, Optics & Photonics.

[30]  Alexei A. Efros,et al.  Automatic photo pop-up , 2005, SIGGRAPH 2005.

[31]  Wilson S. Geisler,et al.  Color as a source of information in the stereo correspondence process , 1990, Vision Research.

[32]  Tony Lindeberg,et al.  Shape from texture from a multi-scale perspective , 1993, 1993 (4th) International Conference on Computer Vision.

[33]  Brian A. Wandell,et al.  A spatial extension of CIELAB for digital color‐image reproduction , 1997 .

[34]  C. WILLIAM TYLER,et al.  Depth perception in disparity gratings , 1974, Nature.

[35]  Honglak Lee,et al.  A Dynamic Bayesian Network Model for Autonomous 3D Reconstruction from a Single Indoor Image , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[36]  Alan C. Bovik,et al.  Bivariate statistical modeling of color and range in natural scenes , 2014, Electronic Imaging.

[37]  Marc Pollefeys,et al.  Pulling Things out of Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Masaaki Ikehara,et al.  HMM-based surface reconstruction from single images , 2002, Proceedings. International Conference on Image Processing.

[39]  D J Field,et al.  Relations between the statistics of natural images and the response properties of cortical cells. , 1987, Journal of the Optical Society of America. A, Optics and image science.

[40]  Alan C. Bovik,et al.  Statistical Modeling of 3-D Natural Scenes With Application to Bayesian Stereopsis , 2011, IEEE Transactions on Image Processing.

[41]  Ashish Kapoor,et al.  Learning a blind measure of perceptual image quality , 2011, CVPR 2011.

[42]  Tai Sing Lee,et al.  Statistical correlations between two-dimensional images and three-dimensional structures in natural scenes. , 2003, Journal of the Optical Society of America. A, Optics, image science, and vision.

[43]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[44]  Ronen Basri,et al.  Example Based 3D Reconstruction from Single 2D Images , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[45]  Christopher W. Tyler,et al.  Sensory processing of binocular disparity , 1983 .

[46]  Andrew Owens,et al.  Shape Anchors for Data-Driven Multi-view Reconstruction , 2013, 2013 IEEE International Conference on Computer Vision.

[47]  Tai Sing Lee,et al.  Scaling Laws in Natural Scenes and the Inference of 3D Shape , 2005, NIPS.

[48]  D. Field,et al.  Natural image statistics and efficient coding. , 1996, Network.

[49]  Antonio Torralba,et al.  Depth Estimation from Image Structure , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[50]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Stéphane Mallat,et al.  A Theory for Multiresolution Signal Decomposition: The Wavelet Representation , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[52]  David J. Field,et al.  Wavelets, vision and the statistics of natural scenes , 1999, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[53]  Alberto Leon-Garcia,et al.  Estimation of shape parameter for generalized Gaussian distributions in subband decompositions of video , 1995, IEEE Trans. Circuits Syst. Video Technol..

[54]  Eero P. Simoncelli,et al.  Natural image statistics and neural representation. , 2001, Annual review of neuroscience.

[55]  Eero P. Simoncelli,et al.  Natural signal statistics and sensory gain control , 2001, Nature Neuroscience.

[56]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[57]  David J. Field,et al.  How Close Are We to Understanding V1? , 2005, Neural Computation.

[58]  Zhou Wang,et al.  Perceptual quality assessment of color images using adaptive signal representation , 2010, Electronic Imaging.

[59]  C. Tyler Spatial organization of binocular disparity sensitivity , 1975, Vision Research.

[60]  Jitendra Malik,et al.  Computing Local Surface Orientation and Shape from Texture for Curved Surfaces , 1997, International Journal of Computer Vision.

[61]  Martin J. Wainwright,et al.  Image denoising using scale mixtures of Gaussians in the wavelet domain , 2003, IEEE Trans. Image Process..

[62]  D. Ruderman The statistics of natural images , 1994 .

[63]  Alan C. Bovik,et al.  Color and Depth Priors in Natural Images , 2013, IEEE Transactions on Image Processing.

[64]  Ping-Sing Tsai,et al.  Shape from Shading: A Survey , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[65]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[66]  D. Marquardt An Algorithm for Least-Squares Estimation of Nonlinear Parameters , 1963 .

[67]  Ashutosh Saxena,et al.  Make3D: Learning 3D Scene Structure from a Single Still Image , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[68]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[69]  Thomas Serre,et al.  Robust Object Recognition with Cortex-Like Mechanisms , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[70]  Alan C. Bovik,et al.  Oriented Correlation Models of Distorted Natural Images With Application to Natural Stereopair Quality Evaluation , 2015, IEEE Transactions on Image Processing.

[71]  Zhou Wang,et al.  Reduced-Reference Image Quality Assessment Using Divisive Normalization-Based Image Representation , 2009, IEEE Journal of Selected Topics in Signal Processing.

[72]  Alan B. Cobo-Lewis,et al.  Selectivity of cyclopean masking for the spatial frequency of binocular disparity modulation , 1994, Vision Research.

[73]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[74]  Alan C. Bovik,et al.  Generalizing a closed-form correlation model of oriented bandpass natural images , 2015, 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[75]  Alan C. Bovik,et al.  Blind Image Quality Assessment: From Natural Scene Statistics to Perceptual Quality , 2011, IEEE Transactions on Image Processing.