论文信息 - Clockwork Convnets for Video Semantic Segmentation

Clockwork Convnets for Video Semantic Segmentation

Recent years have seen tremendous progress in still-image segmentation; however the naive application of these state-of-the-art algorithms to every video frame requires considerable computation and ignores the temporal continuity inherent in video. We propose a video recognition framework that relies on two key observations: 1) while pixels may change rapidly from frame to frame, the semantic content of a scene evolves more slowly, and 2) execution can be viewed as an aspect of architecture, yielding purpose-fit computation schedules for networks. We define a novel family of "clockwork" convnets driven by fixed or adaptive clock signals that schedule the processing of different layers at different update rates according to their semantic stability. We design a pipeline schedule to reduce latency for real-time recognition and a fixed-rate schedule to reduce overall computation. Finally, we extend clockwork scheduling to adaptive video processing by incorporating data-driven clocks that can be tuned on unlabeled video. The accuracy and efficiency of clockwork convnets are evaluated on the Youtube-Objects, NYUD, and Cityscapes video datasets.

[1] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2] Rob Fergus,et al. Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[3] Mei Han,et al. Efficient hierarchical graph-based video segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5] Joan Bruna,et al. Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[6] Le Song,et al. Deep Fried Convnets , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[7] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Thomas Brox,et al. FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10] Terrence J. Sejnowski,et al. Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[11] Li Fei-Fei,et al. End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[13] Vittorio Ferrari,et al. Fast Object Segmentation in Unconstrained Video , 2013, 2013 IEEE International Conference on Computer Vision.

[14] Jian Sun,et al. Convolutional neural networks at constrained time cost , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Ivan Laptev,et al. On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[16] Andrew Zisserman,et al. Speeding up Convolutional Neural Networks with Low Rank Expansions , 2014, BMVC.

[17] Luc Van Gool,et al. The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[18] James M. Rehg,et al. Weakly Supervised Learning of Object Segmentations from Web-Scale Video , 2012, ECCV Workshops.

[19] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Jitendra Malik,et al. Learning to segment moving objects in videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22] Fei-Fei Li,et al. Discriminative Segment Annotation in Weakly Labeled Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[23] Jitendra Malik,et al. Motion segmentation and tracking using normalized cuts , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[24] Cordelia Schmid,et al. Learning object class detectors from weakly annotated video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25] Vincent Vanhoucke,et al. Improving the speed of neural networks on CPUs , 2011 .

[26] Derek Hoiem,et al. Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[27] Chenliang Xu,et al. Evaluation of super-voxel methods for early video processing , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[28] Jürgen Schmidhuber,et al. A Clockwork RNN , 2014, ICML.

[29] Sebastian Ramos,et al. The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] J. MUYBRIDGE,et al. The Horse in Motion , 1882, Nature.

[31] Kristen Grauman,et al. Supervoxel-Consistent Foreground Propagation in Video , 2014, ECCV.

[32] Vladlen Koltun,et al. Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[33] Xiao Liu,et al. Weakly Supervised Multiclass Video Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.