Modeling spatial layout for scene image understanding via a novel multiscale sum-product network

A new deep architecture MSPN is proposed for image segmentation.Multiscale unary potentials are used to model image spatial layouts.A superpixel-based refinement method is used to improve the parsing results. Semantic image segmentation is challenging due to the large intra-class variations and the complex spatial layouts inside natural scenes. This paper investigates this problem by designing a new deep architecture, called multiscale sum-product network (MSPN), which utilizes multiscale unary potentials as the inputs and models the spatial layouts of image content in a hierarchical manner. That is, the proposed MSPN models the joint distribution of multiscale unary potentials and object classes instead of single unary potentials in popular settings. Besides, MSPN characterizes scene spatial layouts in a fine-to-coarse manner to enforce the consistency in labeling. Multiscale unary potentials at different scales can thus help overcome semantic ambiguities caused by only evaluating single local regions, while long-range spatial correlations can further refine image labeling. In addition, higher orders are able to pose the constraints among labels. By this way, multi-scale unary potentials, long-range spatial correlations, higher-order priors are well modeled under the uniform framework in MSPN. We conduct experiments on two challenging benchmarks consisting of the MSRC-21 dataset and the SIFT FLOW dataset. The results demonstrate the superior performance of our method comparing with the previous graphical models for understanding scene images.

[1]  Adnan Darwiche,et al.  A differential approach to inference in Bayesian networks , 2000, JACM.

[2]  Daniel P. Huttenlocher,et al.  Efficient Graph-Based Image Segmentation , 2004, International Journal of Computer Vision.

[3]  Joost van de Weijer,et al.  Harmony Potentials , 2011, International Journal of Computer Vision.

[4]  Pedro M. Domingos,et al.  Discriminative Learning of Sum-Product Networks , 2012, NIPS.

[5]  Pushmeet Kohli,et al.  Energy minimization for linear envelope MRFs , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  B. Schiele,et al.  Interleaved Object Categorization and Segmentation , 2003, BMVC.

[7]  Miguel Á. Carreira-Perpiñán,et al.  Multiscale conditional random fields for image labeling , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[8]  Cristian Sminchisescu,et al.  Object Recognition by Sequential Figure-Ground Ranking , 2011, International Journal of Computer Vision.

[9]  Nicolas Heess,et al.  The Shape Boltzmann Machine: A strong model of object shape , 2012, CVPR.

[10]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Michael C. Nechyba,et al.  Interpretation of complex scenes using dynamic tree-structure Bayesian networks , 2007, Comput. Vis. Image Underst..

[12]  Zhuowen Tu,et al.  Image Parsing: Unifying Segmentation, Detection, and Recognition , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[13]  Mohamed R. Amer,et al.  Sum-product networks for modeling activities with stochastic structure , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Xiaogang Wang,et al.  A Deep Sum-Product Architecture for Robust Facial Attributes Analysis , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[16]  Giovanni Maria Farinella,et al.  Semantic segmentation of images exploiting DCT based features and random forest , 2016, Pattern Recognit..

[17]  De Xu,et al.  Region Contextual Visual Words for scene categorization , 2011, Expert Syst. Appl..

[18]  Sanja Fidler,et al.  Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Jianxiong Xiao,et al.  Characterizing Layouts of Outdoor Scenes Using Spatial Topic Processes , 2013, 2013 IEEE International Conference on Computer Vision.

[20]  José Mira Mira,et al.  Knowledge modeling for the image understanding task as a design task , 2005, Expert Syst. Appl..

[21]  Bernt Schiele,et al.  Interleaving Object Categorization and Segmentation , 2006, Cognitive Vision Systems.

[22]  Joost van de Weijer,et al.  Fusing Global and Local Scale for Semantic Image Segmentation , 2011 .

[23]  Pushmeet Kohli,et al.  Graph Cut Based Inference with Co-occurrence Statistics , 2010, ECCV.

[24]  Jitendra Malik,et al.  Semantic segmentation using regions and parts , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Stefan B. Williams,et al.  Hierarchical Bayesian models for unsupervised scene understanding , 2015, Comput. Vis. Image Underst..

[26]  Antonio Torralba,et al.  Nonparametric scene parsing: Label transfer via dense scene alignment , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Tong Lu,et al.  A Novel Context-Aware Topic Model for Category Discovery in Natural Scenes , 2014, ACCV.

[28]  Tommaso Gritti,et al.  Semantic video scene segmentation and transfer , 2014, Comput. Vis. Image Underst..

[29]  Pushmeet Kohli,et al.  Associative hierarchical CRFs for object class image segmentation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[30]  Jitendra Malik,et al.  Recognition using regions , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Svetlana Lazebnik,et al.  Superparsing - Scalable Nonparametric Image Parsing with Superpixels , 2010, International Journal of Computer Vision.

[32]  Kristen Grauman,et al.  Efficient region search for object detection , 2011, CVPR 2011.

[33]  Subhransu Maji,et al.  Object segmentation by alignment of poselet activations to image contours , 2011, CVPR 2011.

[34]  Bin Fang,et al.  Scene classification based on single-layer SAE and SVM , 2015, Expert Syst. Appl..

[35]  Pedro M. Domingos,et al.  Sum-product networks: A new deep architecture , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[36]  R. Zemel,et al.  Multiscale conditional random fields for image labeling , 2004, CVPR 2004.

[37]  Palaiahnakote Shivakumara,et al.  A Novel Topic-Level Random Walk Framework for Scene Image Co-segmentation , 2014, ECCV.

[38]  Pushmeet Kohli,et al.  Robust Higher Order Potentials for Enforcing Label Consistency , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Feng Han,et al.  Bottom-Up/Top-Down Image Parsing with Attribute Grammar , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Huimin Yu,et al.  Deep Learning Shape Priors for Object Segmentation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Antonio Criminisi,et al.  TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context , 2007, International Journal of Computer Vision.

[42]  Alexei A. Efros,et al.  Ensemble of exemplar-SVMs for object detection and beyond , 2011, 2011 International Conference on Computer Vision.

[43]  Pushmeet Kohli,et al.  A Principled Deep Random Field Model for Image Segmentation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.