Unsupervised learning of supervoxel embeddings for video Segmentation

We present an algorithm for learning a feature representation for video segmentation. Standard video segmentation algorithms utilize similarity measurements in order to group related pixels. The contribution of our paper is an unsupervised method for learning the feature representation used for this similarity. The feature representation is defined over video supervoxels. An embedding framework learns a feature mapping for supervoxels in an unsupervised fashion such that supervoxels with similar context have similar embeddings. Based on the learned representation, we can merge similar supervoxels into spatio-temporal segments. Experimental results demonstrate the effectiveness of this learned supervoxel embedding on standard benchmark data.

[1]  Kurt Keutzer,et al.  Long term video segmentation through pixel level spectral clustering on GPUs , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[2]  Mubarak Shah,et al.  Video Object Segmentation through Spatially Accurate and Temporally Dense Extraction of Primary Object Regions , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[4]  Thomas Brox,et al.  A Unified Video Segmentation Benchmark: Annotation, Metrics and Analysis , 2013, 2013 IEEE International Conference on Computer Vision.

[5]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Thomas Brox,et al.  Spectral Graph Reduction for Efficient Image and Streaming Video Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  John W. Fisher,et al.  A Video Representation Using Temporal Superpixels , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Bernt Schiele,et al.  Classifier based graph construction for video segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ran Xu,et al.  Human action segmentation with hierarchical supervoxel consistency , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Edward H. Adelson,et al.  A unified mixture framework for motion segmentation: incorporating spatial coherence and estimating the number of models , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Chenliang Xu,et al.  Streaming Hierarchical Video Segmentation , 2012, ECCV.

[12]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Bernt Schiele,et al.  Video Segmentation with Superpixels , 2012, ACCV.

[14]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Ignas Budvytis,et al.  Mixture of Trees Probabilistic Graphical Model for Video Segmentation , 2013, International Journal of Computer Vision.

[16]  P. Anandan,et al.  A unified approach to moving object detection in 2D and 3D scenes , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[17]  Jitendra Malik,et al.  Motion segmentation and tracking using normalized cuts , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[18]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[19]  Fei-Fei Li,et al.  Learning Temporal Embeddings for Complex Video Analysis , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.