Deep Multicameral Decoding for Localizing Unoccluded Object Instances from a Single RGB Image

Oriented boundary detection is a challenging task aimed at both delineating category-agnostic object instances and inferring their spatial layout from a single RGB image. State-of-the-art deep convolutional networks for this task rely on two independent streams that predict boundaries and occlusions respectively, although both require similar local and global cues, and occlusions cause boundaries. We therefore propose a fully convolutional bicameral structuring, composed of two cascaded decoders sharing one deep encoder, linked altogether by skip connections to combine local and global features, for jointly predicting instance boundaries and their unoccluded side. Furthermore, state-of-the-art datasets contain real images with few instances and occlusions mostly due to objects occluding the background, thereby missing meaningful occlusions between instances. For evaluating the missing scenario of dense piles of objects as well, we introduce synthetic data (Mikado), which extensibly contains more instances and inter-instance occlusions per image than the PASCAL Instance Occlusion Dataset (PIOD), the COCO Amodal dataset (COCOA), and the Densely Segmented Supermarket Amodal dataset (D2SA). We show that the proposed network design outperforms the two-stream baseline and alternative archiectures for oriented boundary detection on both PIOD and Mikado, and the amodal segmentation approach on COCOA as well. Our experiments on D2SA also show that Mikado is plausible in the sense that it enables the learning of performance-enhancing representations transferable to real data, while drastically reducing the need of hand-made annotations for finetuning.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Chunhua Shen,et al.  Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[4]  Nikolaos G. Tsagarakis,et al.  Detecting object affordances with Convolutional Neural Networks , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[5]  Michael J. Black,et al.  Occlusion Boundary Detection via Deep Exploration of Context , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Stella X. Yu,et al.  Affinity CNN: Learning Pixel-Centric Pairwise Relations for Figure/Ground Embedding , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jitendra Malik,et al.  Learning to detect natural image boundaries using local brightness, color, and texture cues , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Tyler Lu,et al.  Impossibility Theorems for Domain Adaptation , 2010, AISTATS.

[9]  Nikolaos Grammalidis,et al.  Disparity and occlusion estimation in multiocular systems and their coding for the communication of multiview image sequences , 1998, IEEE Trans. Circuits Syst. Video Technol..

[10]  Yuandong Tian,et al.  Semantic Amodal Segmentation , 2015, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Alan L. Yuille,et al.  DOC: Deep OCclusion Estimation from a Single Image , 2015, ECCV.

[12]  Jianbo Shi,et al.  DeepEdge: A multi-scale bifurcated deep network for top-down contour detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ashutosh Saxena,et al.  Learning Depth from Single Monocular Images , 2005, NIPS.

[14]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[15]  Yuan Xie,et al.  Instance-Level Salient Object Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Luc Van Gool,et al.  Convolutional Oriented Boundaries , 2016, ECCV.

[17]  Liqing Zhang,et al.  Object proposal by multi-branch hierarchical segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[19]  Stefano Soatto,et al.  Sparse Occlusion Detection with Optical Flow , 2012, International Journal of Computer Vision.

[20]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[21]  Xiaogang Wang,et al.  Deep Dual Learning for Semantic Image Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Martial Hebert,et al.  Local detection of occlusion boundaries in video , 2009, Image Vis. Comput..

[23]  Joseph J. Lim,et al.  Sketch Tokens: A Learned Mid-level Representation for Contour and Object Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[25]  Markus Ulrich,et al.  MVTec D2S: Densely Segmented Supermarket Dataset , 2018, ECCV.

[26]  Jianbo Shi,et al.  High-for-Low and Low-for-High: Efficient Boundary Detection from Deep Object Features and Its Applications to High-Level Vision , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Yang Zou,et al.  Simultaneous Edge Alignment and Learning , 2018, ECCV.

[28]  Yan Wang,et al.  DeepContour: A deep convolutional feature learned by positive-sharing loss for contour detection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[32]  Jonathan T. Barron,et al.  Multiscale Combinatorial Grouping for Image Segmentation and Object Proposal Generation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Xin Zhao,et al.  Deep Crisp Boundaries , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Yiannis Aloimonos,et al.  Fast 2D border ownership assignment , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Takeo Kanade,et al.  A Cooperative Algorithm for Stereo Matching and Occlusion Detection , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[37]  Trevor Darrell,et al.  Learning to Segment Every Thing , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[39]  Michael Isard,et al.  Estimating disparity and occlusions in stereo video sequences , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[40]  Stefano Soatto,et al.  Occlusion Detection and Motion Estimation with Convex Optimization , 2010, NIPS.

[41]  Philip H. S. Torr,et al.  Pixelwise Instance Segmentation with a Dynamically Instantiated Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Min Bai,et al.  Deep Watershed Transform for Instance Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Gabriel J. Brostow,et al.  Learning to find occlusion regions , 2011, CVPR 2011.

[45]  Alexei A. Efros,et al.  Recovering Occlusion Boundaries from a Single Image , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[46]  Honglak Lee,et al.  Object Contour Detection with a Fully Convolutional Encoder-Decoder Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  James L. Crowley,et al.  Symmetry Aware Evaluation of 3D Object Detection and Pose Estimation in Scenes of Many Parts in Bulk , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[48]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Michael Maire,et al.  Simultaneous Segmentation and Figure/Ground Organization Using Angular Embedding , 2010, ECCV.

[50]  Shengjun Liu,et al.  Learning to predict crisp boundaries , 2018, ECCV.

[51]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[52]  Shu Liu,et al.  Path Aggregation Network for Instance Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Alan L. Yuille,et al.  Occlusions and binocular stereo , 1992, International Journal of Computer Vision.

[55]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[56]  Jitendra Malik,et al.  A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[57]  Charless C. Fowlkes,et al.  Contour Detection and Hierarchical Image Segmentation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[59]  Jitendra Malik,et al.  Figure/Ground Assignment in Natural Images , 2006, ECCV.

[60]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[61]  Darwin G. Caldwell,et al.  AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[62]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[64]  Ian D. Reid,et al.  Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[66]  C. Lawrence Zitnick,et al.  Structured Forests for Fast Edge Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[67]  Vincent Dumoulin,et al.  Deconvolution and Checkerboard Artifacts , 2016 .

[68]  Deqing Sun,et al.  Local Layering for Joint Motion Estimation and Occlusion Detection , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[69]  Jinhui Tang,et al.  Richer Convolutional Features for Edge Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.