Temporally enhanced image object proposals for videos

Despite the recent success of image object proposals (IOPs) for image applications, the per-frame IOPs are also important for video applications. However, the existing IOPs are extracted from each frame separately and may exhibit inconsistencies across the frames. In this paper, we propose to improve the existing IOPs by enforcing the temporal consistency through a video sequence in an on-line manner. To achieve this, we propose a novel spatio-temporal objectness measure considering both the frame level objectness as well as the temporal consistency across frames. An on-line dynamic programing technique is proposed to efficiently compute such spatio-temporal objectness. In addition, compared with the spatio-temporal video object proposals(VOPs), the proposed method supports on-line applications and provides more accurate per-frame localizations. Experiments on benchmark datasets validate its superior performance compared with the existing IOPs and VOPs.

[1]  Junsong Yuan,et al.  Fast Appearance Modeling for Automatic Primary Video Object Segmentation , 2016, IEEE Transactions on Image Processing.

[2]  Charless C. Fowlkes,et al.  Globally-optimal greedy algorithms for tracking a variable number of objects , 2011, CVPR 2011.

[3]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[4]  Jitendra Malik,et al.  Learning to segment moving objects in videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Gang Yu,et al.  Fast action proposals for human action detection and search , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Nicu Sebe,et al.  Unsupervised Tube Extraction Using Transductive Learning and Dense Trajectories , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Cordelia Schmid,et al.  Spatio-temporal Object Detection Proposals , 2014, ECCV.

[9]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[10]  David A. Forsyth,et al.  Video Event Detection: From Subvolume Localization to Spatiotemporal Path Search , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Patrick Bouthemy,et al.  Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Cees Snoek,et al.  APT: Action localization proposals from dense trajectories , 2015, BMVC.

[14]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[15]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Haroon Idrees,et al.  Predicting the Where and What of Actors and Actions through Online Action Localization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Junsong Yuan,et al.  Discovering Primary Objects in Videos by Saliency Fusion and Iterative Appearance Estimation , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[19]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.