Hallucinated Humans as the Hidden Context for Labeling 3D Scenes

For scene understanding, one popular approach has been to model the object-object relationships. In this paper, we hypothesize that such relationships are only an artifact of certain hidden factors, such as humans. For example, the objects, monitor and keyboard, are strongly spatially correlated only because a human types on the keyboard while watching the monitor. Our goal is to learn this hidden human context (i.e., the human-object relationships), and also use it as a cue for labeling the scenes. We present Infinite Factored Topic Model (IFTM), where we consider a scene as being generated from two types of topics: human configurations and human-object relationships. This enables our algorithm to hallucinate the possible configurations of the humans in the scene parsimoniously. Given only a dataset of scenes containing objects but not humans, we show that our algorithm can recover the human object relationships. We then test our algorithm on the task of attribute and object labeling in 3D scenes and show consistent improvements over the state-of-the-art.

[1]  J. Gibson The Ecological Approach to Visual Perception , 1979 .

[2]  Colin Potts,et al.  Design of Everyday Things , 1988 .

[3]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[4]  Ashutosh Saxena,et al.  Learning Depth from Single Monocular Images , 2005, NIPS.

[5]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[6]  Ashutosh Saxena,et al.  Cascaded Classification Models: Combining Models for Holistic Scene Understanding , 2008, NIPS.

[7]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Daniel Huber,et al.  Using Context to Create Semantic 3D Models of Indoor Environments , 2010, BMVC.

[10]  Eren Erdal Aksoy,et al.  Categorizing object-action relations from semantic scene graphs , 2010, 2010 IEEE International Conference on Robotics and Automation.

[11]  Takeo Kanade,et al.  Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces , 2010, NIPS.

[12]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  David A. Forsyth,et al.  Thinking Inside the Box: Using Appearance Models and Context Based on Room Geometry , 2010, ECCV.

[14]  Yee Whye Teh,et al.  Dirichlet Process , 2017, Encyclopedia of Machine Learning and Data Mining.

[15]  Alexei A. Efros,et al.  From 3D scene geometry to human workspace , 2011, CVPR 2011.

[16]  Luc Van Gool,et al.  What makes a chair a chair? , 2011, CVPR 2011.

[17]  Ashutosh Saxena,et al.  Efficient grasping from RGBD images: Learning using a new rectangle representation , 2011, 2011 IEEE International Conference on Robotics and Automation.

[18]  Thorsten Joachims,et al.  Semantic Labeling of 3D Point Clouds for Indoor Scenes , 2011, NIPS.

[19]  Thorsten Joachims,et al.  Contextually Guided Semantic Labeling and Search for 3D Point Clouds , 2011, ArXiv.

[20]  Bart Selman,et al.  Human Activity Detection from RGBD Images , 2011, Plan, Activity, and Intent Recognition.

[21]  Tsuhan Chen,et al.  Toward Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Yun Jiang,et al.  Learning Object Arrangements in 3D Scenes using Human Context , 2012, ICML.

[23]  Yun Jiang,et al.  Hallucinating Humans for Learning Robotic Placement of Objects , 2012, ISER.

[24]  Yun Jiang,et al.  Learning to place new objects in a scene , 2012, Int. J. Robotics Res..

[25]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Alexei A. Efros,et al.  Scene Semantics from Long-Term Observation of People , 2012, ECCV.

[27]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[28]  Yun Jiang,et al.  Discovering Different Types of Topics: Factored Topic Models , 2013, IJCAI.

[29]  Thorsten Joachims,et al.  Contextually guided semantic labeling and search for three-dimensional point clouds , 2013, Int. J. Robotics Res..

[30]  Tsuhan Chen,et al.  3D-Based Reasoning with Blocks, Support, and Stability , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[32]  Alexei A. Efros,et al.  People Watching: Human Actions as a Cue for Single View Geometry , 2012, International Journal of Computer Vision.