A Generic Model to Compose Vision Modules for Holistic Scene Understanding

The problem of holistic scene understanding involves many vision tasks such as depth estimation, scene categorization, event categorization, etc. Each of these tasks explores some aspects of the scene but, these tasks are related in that, they represent attributes of the same scene. An intuition is that one task can provide meaningful attributes to aid the learning process of another task. In this work, we propose a generic model (together with learning and inference techniques) for connecting different vision tasks in the form of a 2-layer cascade. Our model considers the first layer as a hidden layer, where the latent variables are inferred by feedback from the second layer. The feedback mechanism allows the first layer classifiers to focus on more important image modes, and draws their output towards "attributes" rather than the original "labels". Our model also automatically discovers sparse connections between the learned attributes on the first layer and the target task on the second layer. Note that in our model, the same vision tasks can act as attribute learners as well as target tasks, while being set up on different layers. In extensive experiments, we show that the same proposed model improves the performance in all the tasks we consider: single image depth estimation, scene categorization, saliency detection and event categorization.

[1]  Antonio Torralba,et al.  Learning hierarchical models of scenes, objects, and parts , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[2]  Zhuowen Tu,et al.  Image Parsing: Unifying Segmentation, Detection, and Recognition , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[3]  Li Fei-Fei,et al.  Towards total scene understanding: Classification, annotation and segmentation in an automatic framework , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  Alexei A. Efros,et al.  Putting Objects in Perspective , 2006, CVPR.

[6]  Ashutosh Saxena,et al.  3-D Depth Reconstruction from a Single Still Image , 2007, International Journal of Computer Vision.

[7]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[8]  Martial Hebert,et al.  Discriminative Sparse Image Models for Class-Specific Edge Detection and Image Interpretation , 2008, ECCV.

[9]  S. Süsstrunk,et al.  Frequency-tuned salient region detection , 2009, CVPR 2009.

[10]  Alexei A. Efros,et al.  Closing the loop in scene interpretation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[12]  Andrew J. Davison,et al.  Active Matching , 2008, ECCV.

[13]  Martial Hebert,et al.  A hierarchical field framework for unified context-based classification , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[14]  David J. C. MacKay,et al.  Variational Gaussian process classifiers , 2000, IEEE Trans. Neural Networks Learn. Syst..

[15]  Andrew Zisserman,et al.  Learning Visual Attributes , 2007, NIPS.

[16]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Antonio Torralba,et al.  Depth from Familiar Objects: A Hierarchical Model for 3D Scenes , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[18]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[19]  Gang Wang,et al.  Joint learning of visual attributes, object classes and visual saliency , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[20]  Antonio Torralba,et al.  Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. , 2006, Psychological review.

[21]  Fei-Fei Li,et al.  What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[22]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Ashutosh Saxena,et al.  Cascaded Classification Models: Combining Models for Holistic Scene Understanding , 2008, NIPS.

[24]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.