Overcoming Occlusion with Inverse Graphics

Scene understanding tasks such as the prediction of object pose, shape, appearance and illumination are hampered by the occlusions often found in images. We propose a vision-as-inverse-graphics approach to handle these occlusions by making use of a graphics renderer in combination with a robust generative model (GM). Since searching over scene factors to obtain the best match for an image is very inefficient, we make use of a recognition model (RM) trained on synthetic data to initialize the search. This paper addresses two issues: (i) We study how the inferences are affected by the degree of occlusion of the foreground object, and show that a robust GM which includes an outlier model to account for occlusions works significantly better than a non-robust model. (ii) We characterize the performance of the RM and the gains that can be made by refining the search using the GM, using a new dataset that includes background clutter and occlusions. We find that pose and shape are predicted very well by the RM, but appearance and especially illumination less so. However, accuracy on these latter two factors can be clearly improved with the generative model.

[1]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[2]  Geoffrey E. Hinton,et al.  The Helmholtz Machine , 1995, Neural Computation.

[3]  Geoffrey E. Hinton,et al.  Instantiating Deformable Models with a Neural Net , 1997, Comput. Vis. Image Underst..

[4]  Paul E. Debevec,et al.  Rendering synthetic objects into real scenes: bridging traditional and image-based graphics with global illumination and high dynamic range photography , 1998, SIGGRAPH '08.

[5]  Matthew Turk,et al.  A Morphable Model For The Synthesis Of 3D Faces , 1999, SIGGRAPH.

[6]  Mark R. Stevens,et al.  Integrating Graphics and Vision for Object Recognition , 2000 .

[7]  Pat Hanrahan,et al.  A signal-processing framework for inverse rendering , 2001, SIGGRAPH.

[8]  Ronen Basri,et al.  Lambertian reflectance and linear subspaces , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[9]  Pat Hanrahan,et al.  An efficient representation for irradiance environment maps , 2001, SIGGRAPH.

[10]  Pat Hanrahan,et al.  A signal-processing framework for forward and inverse rendering , 2002 .

[11]  Christopher K. I. Williams,et al.  Greedy Learning of Multiple Objects in Images Using Robust Statistics and Factorial Learning , 2004, Neural Computation.

[12]  Michael J. Black,et al.  On the unification of line processes, outlier rejection, and robust statistics with applications in early vision , 1996, International Journal of Computer Vision.

[13]  David J. Kriegman,et al.  What Is the Set of Images of an Object Under All Possible Illumination Conditions? , 1998, International Journal of Computer Vision.

[14]  Ravi Ramamoorthi,et al.  Modeling Illumination Variation with Spherical Harmonics , 2005 .

[15]  Luc Van Gool,et al.  A Mean Field EM-algorithm for Coherent Occlusion Handling in MAP-Estimation Prob , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[16]  A. Yuille,et al.  Opinion TRENDS in Cognitive Sciences Vol.10 No.7 July 2006 Special Issue: Probabilistic models of cognition Vision as Bayesian inference: analysis by synthesis? , 2022 .

[17]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[18]  Erik Reinhard,et al.  High Dynamic Range Imaging: Acquisition, Display, and Image-Based Lighting , 2010 .

[19]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[20]  Daphne Koller,et al.  A segmentation-aware object detection model with occlusion handling , 2011, CVPR 2011.

[21]  Pat Hanrahan,et al.  Example-based synthesis of 3D object arrangements , 2012, ACM Trans. Graph..

[22]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23]  Bernt Schiele,et al.  Detailed 3D Representations for Object Recognition and Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Michael J. Black,et al.  OpenDR: An Approximate Differentiable Renderer , 2014, ECCV.

[25]  Andrew W. Fitzgibbon,et al.  Multi-output Learning for Camera Relocalization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Joshua B. Tenenbaum,et al.  Inverse Graphics with Probabilistic CAD Models , 2014, ArXiv.

[27]  Joshua B. Tenenbaum,et al.  Efficient analysis-by-synthesis in vision: A computational framework, behavioral tests, and modeling neuronal representations , 2015, Annual Meeting of the Cognitive Science Society.

[28]  Eric Brachmann,et al.  Learning Analysis-by-Synthesis for 6D Pose Estimation in RGB-D Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[30]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jitendra Malik,et al.  Shape, Illumination, and Reflectance from Shading , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Sebastian Nowozin,et al.  The informed sampler: A discriminative approach to Bayesian inference in generative computer vision models , 2014, Comput. Vis. Image Underst..

[33]  Thomas Brox,et al.  Learning to generate chairs with convolutional neural networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  J. Tenenbaum,et al.  Efficient analysis-by-synthesis in vision : A computational framework , behavioral tests , and comparison with neural representations , 2015 .

[35]  Joshua B. Tenenbaum,et al.  Picture: A probabilistic programming language for scene perception , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).