DeepContext: Context-Encoding Neural Pathways for 3D Holistic Scene Understanding

3D context has been shown to be extremely important for scene understanding, yet very little research has been done on integrating context information with deep neural network architectures. This paper presents an approach to embed 3D context into the topology of a neural network trained to perform holistic scene understanding. Given a depth image depicting a 3D scene, our network aligns the observed scene with a predefined 3D scene template, and then reasons about the existence and location of each object within the scene template. In doing so, our model recognizes multiple objects in a single forward pass of a 3D convolutional neural network, capturing both global scene and local object information simultaneously. To create training data for this 3D network, we generate partially synthetic depth images which are rendered by replacing real objects with a repository of CAD models of the same object category1. Extensive experiments demonstrate the effectiveness of our algorithm compared to the state of the art.

[1]  M. Potter Short-term conceptual memory for pictures. , 1976, Journal of experimental psychology. Human learning and memory.

[2]  Zhuowen Tu,et al.  Image Parsing: Unifying Segmentation, Detection, and Recognition , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[3]  Antonio Torralba,et al.  Contextual Priming for Object Detection , 2003, International Journal of Computer Vision.

[4]  Feng Han,et al.  Bottom-up/top-down image parsing by attribute graph grammar , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[5]  Antonio Torralba,et al.  Describing Visual Scenes using Transformed Dirichlet Processes , 2005, NIPS.

[6]  Antonio Torralba,et al.  Learning hierarchical models of scenes, objects, and parts , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[7]  Antonio Torralba,et al.  Depth from Familiar Objects: A Hierarchical Model for 3D Scenes , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[8]  Alexei A. Efros,et al.  Putting Objects in Perspective , 2006, CVPR.

[9]  Antonio Torralba,et al.  Describing Visual Scenes Using Transformed Objects and Parts , 2008, International Journal of Computer Vision.

[10]  Andrea Vedaldi,et al.  Objects in Context , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[11]  A. Torralba,et al.  The role of context in object recognition , 2007, Trends in Cognitive Sciences.

[12]  Ashutosh Saxena,et al.  Cascaded Classification Models: Combining Models for Holistic Scene Understanding , 2008, NIPS.

[13]  Zhuowen Tu,et al.  Auto-context and its application to high-level vision tasks , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Michael I. Jordan,et al.  Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes , 2008, NIPS.

[15]  Dirk B. Walther,et al.  Natural Scene Categories Revealed in Distributed Patterns of Activity in the Human Brain , 2009, The Journal of Neuroscience.

[16]  Feng Han,et al.  Bottom-Up/Top-Down Image Parsing with Attribute Grammar , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Charless C. Fowlkes,et al.  Discriminative models for multi-class object layout , 2009, ICCV.

[18]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Li Fei-Fei,et al.  Towards total scene understanding: Classification, annotation and segmentation in an automatic framework , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Pushmeet Kohli,et al.  Graph Cut Based Inference with Co-occurrence Statistics , 2010, ECCV.

[21]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[22]  Antonio Torralba,et al.  Exploiting hierarchical context on a large database of object categories , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23]  Song-Chun Zhu,et al.  A Numerical Study of the Bottom-Up and Top-Down Inference Processes in And-Or Graphs , 2011, International Journal of Computer Vision.

[24]  Charles Kemp,et al.  How to Grow a Mind: Statistics, Structure, and Abstraction , 2011, Science.

[25]  Chi-Keung Tang,et al.  Make it home: automatic optimization of furniture arrangement , 2011, SIGGRAPH 2011.

[26]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[27]  Tsuhan Chen,et al.  Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models , 2010, NIPS.

[28]  Antonio Torralba,et al.  Context models and out-of-context objects , 2012, Pattern Recognit. Lett..

[29]  Antonio Torralba,et al.  A Tree-Based Context Model for Object Recognition , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Jessica B. Hamrick,et al.  Simulation as an engine of physical scene understanding , 2013, Proceedings of the National Academy of Sciences.

[31]  Joshua B. Tenenbaum,et al.  Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs , 2013, NIPS.

[32]  Jianxiong Xiao,et al.  A Linear Approach to Matching Cuboids in RGBD Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Sanja Fidler,et al.  Holistic Scene Understanding for 3D Object Detection with RGBD Cameras , 2013, 2013 IEEE International Conference on Computer Vision.

[34]  Silvio Savarese,et al.  Understanding Indoor Scenes Using 3D Geometric Phrases , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Song-Chun Zhu,et al.  Scene Parsing by Integrating Function, Geometry and Appearance Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  S. Savarese,et al.  Supplemental Material : Understanding Indoor Scenes using 3 D Geometric Phrases , 2013 .

[37]  Niloy J. Mitra,et al.  Creating consistent scene graphs using a probabilistic grammar , 2014, ACM Trans. Graph..

[38]  Sanja Fidler,et al.  The Role of Context for Object Detection and Semantic Segmentation in the Wild , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  A. Khosla,et al.  3D ShapeNets for 2.5D Object Recognition and Next-Best-View Prediction , 2014, ArXiv.

[40]  Song-Chun Zhu,et al.  Integrating Function , Geometry , Appearance for Scene Parsing , 2014 .

[41]  Yinda Zhang,et al.  PanoContext: A Whole-Room 3D Context Model for Panoramic Scene Understanding , 2014, ECCV.

[42]  Jianxiong Xiao,et al.  Sliding Shapes for 3D Object Detection in Depth Images , 2014, ECCV.

[43]  Sebastian Scherer,et al.  VoxNet: A 3D Convolutional Neural Network for real-time object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[44]  Leonidas J. Guibas,et al.  Joint embeddings of shapes and images via CNN image purification , 2015, ACM Trans. Graph..

[45]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[46]  Leonidas J. Guibas,et al.  Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[47]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[49]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[50]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Roberto Cipolla,et al.  SceneNet: Understanding Real World Indoor Scenes With Synthetic Data , 2015, ArXiv.

[52]  Joshua B. Tenenbaum,et al.  Picture: A probabilistic programming language for scene perception , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[54]  Erik B. Sudderth,et al.  Three-Dimensional Object Detection and Layout Prediction Using Clouds of Oriented Gradients , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Jianxiong Xiao,et al.  Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.