Pixel Objectness: Learning to Segment Generic Objects Automatically in Images and Videos

We propose an end-to-end learning framework for segmenting generic objects in both images and videos. Given a novel image or video, our approach produces a pixel-level mask for all “object-like” regions—even for object categories never seen during training. We formulate the task as a structured prediction problem of assigning an object/background label to each pixel, implemented using a deep fully convolutional network. When applied to a video, our model further incorporates a motion stream, and the network learns to combine both appearance and motion and attempts to extract all prominent objects whether they are moving or not. Beyond the core model, a second contribution of our approach is how it leverages varying strengths of training annotations. Pixel-level annotations are quite difficult to obtain, yet crucial for training a deep network approach for segmentation. Thus we propose ways to exploit weakly labeled data for learning dense foreground segmentation. For images, we show the value in mixing object category examples with image-level labels together with relatively few images with boundary-level annotations. For video, we show how to bootstrap weakly annotated videos together with the network trained for image segmentation. Through experiments on multiple challenging image and video segmentation benchmarks, our method offers consistently strong results and improves the state-of-the-art for fully automatic segmentation of generic (unseen) objects. In addition, we demonstrate how our approach benefits image retrieval and image retargeting, both of which flourish when given our high-quality foreground maps. Code, models, and videos are at: http://vision.cs.utexas.edu/projects/pixelobjectness/.

[1]  Derek Hoiem,et al.  Category Independent Object Proposals , 2010, ECCV.

[2]  Karteek Alahari,et al.  Learning Video Object Segmentation with Visual Memory , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Markus H. Gross,et al.  Fully Connected Object Proposals for Video Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Michael J. Black,et al.  Video Segmentation via Object Flow , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Roberto Cipolla,et al.  Label propagation in video sequences , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[7]  Scott Cohen,et al.  LIVEcut: Learning-based interactive video segmentation by evaluation of multiple propagated cues , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[8]  Ming-Hsuan Yang,et al.  Semantic Co-segmentation in Videos , 2016, ECCV.

[9]  Huchuan Lu,et al.  Saliency Detection via Absorbing Markov Chain , 2013, 2013 IEEE International Conference on Computer Vision.

[10]  Jitendra Malik,et al.  Tracking as Repeated Figure/Ground Segmentation , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Cristian Sminchisescu,et al.  CPMC: Automatic Object Segmentation Using Constrained Parametric Min-Cuts , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Roberto Cipolla,et al.  Semantic object classes in video: A high-definition ground truth database , 2009, Pattern Recognit. Lett..

[14]  Stan Sclaroff,et al.  Saliency Detection: A Boolean Map Approach , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Karteek Alahari,et al.  Learning Motion Patterns in Videos , 2016, CVPR.

[16]  Horst Bischof,et al.  Hough-based tracking of non-rigid objects , 2011, 2011 International Conference on Computer Vision.

[17]  Guillermo Sapiro,et al.  Video SnapCut: robust video object cutout using localized classifiers , 2009, SIGGRAPH 2009.

[18]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[19]  Kristen Grauman,et al.  Active Image Segmentation Propagation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Bernt Schiele,et al.  What Makes for Effective Detection Proposals? , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Xiaogang Wang,et al.  Visual Tracking with Fully Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  James M. Rehg,et al.  Combining Self Training and Active Learning for Video Segmentation , 2011, BMVC.

[23]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[24]  Kristen Grauman,et al.  Active Frame Selection for Label Propagation in Videos , 2012, ECCV.

[25]  Jitendra Malik,et al.  Occlusion boundary detection and figure/ground assignment from optical flow , 2011, CVPR 2011.

[26]  Kristen Grauman,et al.  Supervoxel-Consistent Foreground Propagation in Video , 2014, ECCV.

[27]  Ali Borji,et al.  Salient Object Detection: A Benchmark , 2015, IEEE Transactions on Image Processing.

[28]  Kristen Grauman,et al.  FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[30]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[31]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Vittorio Ferrari,et al.  Fast Object Segmentation in Unconstrained Video , 2013, 2013 IEEE International Conference on Computer Vision.

[33]  Cordelia Schmid,et al.  Learning object class detectors from weakly annotated video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Bernt Schiele,et al.  Learning Video Object Segmentation from Static Images , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Michal Irani,et al.  Video Segmentation by Non-Local Consensus voting , 2014, BMVC.

[36]  Jean Ponce,et al.  Discriminative clustering for image co-segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[37]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[38]  Bernard Ghanem,et al.  DAPs: Deep Action Proposals for Action Understanding , 2016, ECCV.

[39]  Yueting Zhuang,et al.  DeepSaliency: Multi-Task Deep Neural Network Model for Salient Object Detection , 2015, IEEE Transactions on Image Processing.

[40]  Margrit Betke,et al.  Pull the Plug? Predicting If Computers or Humans Should Segment Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Vladimir Kolmogorov,et al.  Object cosegmentation , 2011, CVPR 2011.

[42]  Yong Jae Lee,et al.  Key-segments for video object segmentation , 2011, 2011 International Conference on Computer Vision.

[43]  Subhransu Maji,et al.  Semantic contours from inverse detectors , 2011, 2011 International Conference on Computer Vision.

[44]  Takeo Kanade,et al.  Distributed cosegmentation via submodular optimization on anisotropic diffusion , 2011, 2011 International Conference on Computer Vision.

[45]  Alexander Sorkine-Hornung,et al.  Bilateral Space Video Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Shi-Min Hu,et al.  Global contrast based salient region detection , 2011, CVPR 2011.

[47]  Ming-Hsuan Yang,et al.  Hierarchical Convolutional Features for Visual Tracking , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  Mei Han,et al.  Efficient hierarchical graph-based video segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[49]  Luc Van Gool,et al.  One-Shot Video Object Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Peter V. Gehler,et al.  Video Propagation Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Martial Hebert,et al.  Learning to Find Object Boundaries Using Motion Cues , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[52]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[53]  Cordelia Schmid,et al.  Spatio-temporal Object Detection Proposals , 2014, ECCV.

[54]  Margrit Betke,et al.  Predicting Foreground Object Ambiguity and Efficiently Crowdsourcing the Segmentation(s) , 2017, International Journal of Computer Vision.

[55]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Fei-Fei Li,et al.  Co-localization in Real-World Images , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Noel E. O'Connor,et al.  Shallow and Deep Convolutional Networks for Saliency Prediction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Thomas Deselaers,et al.  What is an object? , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[59]  James M. Rehg,et al.  Video Segmentation by Tracking Many Figure-Ground Segments , 2013, 2013 IEEE International Conference on Computer Vision.

[60]  Bernt Schiele,et al.  Video Segmentation with Superpixels , 2012, ACCV.

[61]  Yong Jae Lee,et al.  Track and Segment: An Iterative Unsupervised Approach for Video Object Proposals , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Erik G. Learned-Miller,et al.  It's Moving! A Probabilistic Model for Causal Motion Segmentation in Moving Camera Videos , 2016, ECCV.

[65]  Nanning Zheng,et al.  Learning to Detect a Salient Object , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66]  Kristen Grauman,et al.  Pixel Objectness , 2017, ArXiv.

[67]  S. Avidan,et al.  Seam carving for content-aware image resizing , 2007, SIGGRAPH 2007.

[68]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[69]  Ronan Collobert,et al.  Learning to Segment Object Candidates , 2015, NIPS.

[70]  Richard Szeliski,et al.  A Database and Evaluation Methodology for Optical Flow , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[71]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[72]  Thomas Brox,et al.  Video Segmentation with Just a Few Strokes , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[73]  Jitendra Malik,et al.  Learning to segment moving objects in videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Ce Liu,et al.  Unsupervised Joint Object Discovery and Segmentation in Internet Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[75]  Jonathan T. Barron,et al.  Multiscale Combinatorial Grouping , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[76]  Tianming Liu,et al.  Predicting eye fixations using convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[77]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Chang-Su Kim,et al.  Primary Object Segmentation in Videos Based on Region Augmentation and Reduction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Jean Ponce,et al.  Multi-class cosegmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[80]  James M. Rehg,et al.  The Secrets of Salient Object Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[81]  Xinlei Chen,et al.  Enriching Visual Knowledge Bases via Object Discovery and Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[82]  Matthieu Guillaumin,et al.  ImageNet Auto-Annotation with Segmentation Propagation , 2014, International Journal of Computer Vision.

[83]  Thomas Deselaers,et al.  Weakly Supervised Localization and Learning with Generic Knowledge , 2012, International Journal of Computer Vision.

[84]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[85]  Ramakant Nevatia,et al.  Spatio-Temporal Action Detection with Cascade Proposal and Location Anticipation , 2017, BMVC.

[86]  James M. Rehg,et al.  Robust video segment proposals with painless occlusion handling , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[87]  Chenliang Xu,et al.  Streaming Hierarchical Video Segmentation , 2012, ECCV.

[88]  Kristen Grauman,et al.  Predicting Sufficient Annotation Strength for Interactive Foreground Segmentation , 2013, 2013 IEEE International Conference on Computer Vision.

[89]  Vittorio Ferrari,et al.  Figure-ground segmentation by transferring window masks , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[90]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[91]  Ce Liu,et al.  Exploring new representations and applications for motion analysis , 2009 .

[92]  Fei-Fei Li,et al.  Discriminative Segment Annotation in Weakly Labeled Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.