Common Action Discovery and Localization in Unconstrained Videos

Similar to common object discovery in images or videos, it is of great interests to discover and locate common actions in videos, which can benefit many video analytics applications such as video summarization, search, and understanding. In this work, we tackle the problem of common action discovery and localization in unconstrained videos, where we do not assume to know the types, numbers or locations of the common actions in the videos. Furthermore, each video can contain zero, one or several common action instances. To perform automatic discovery and localization in such challenging scenarios, we first generate action proposals using human prior. By building an affinity graph among all action proposals, we formulate the common action discovery as a subgraph density maximization problem to select the proposals containing common actions. To avoid enumerating in the exponentially large solution space, we propose an efficient polynomial time optimization algorithm. It solves the problem up to a user specified error bound with respect to the global optimal solution. The experimental results on several datasets show that even without any prior knowledge of common actions, our method can robustly locate the common actions in a collection of videos.

[1]  Fei-Fei Li,et al.  Efficient Image and Video Co-localization with Frank-Wolfe Algorithm , 2014, ECCV.

[2]  Stephen Lin,et al.  Object-Based Multiple Foreground Video Co-segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Cordelia Schmid,et al.  Towards Weakly-Supervised Action Localization , 2016, ArXiv.

[4]  Fei-Fei Li,et al.  Co-localization in Real-World Images , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Jianfei Cai,et al.  CATS: Co-saliency Activated Tracklet Selection for Video Co-localization , 2016, ECCV.

[6]  Mubarak Shah,et al.  Video Object Co-segmentation by Regulated Maximum Weight Cliques , 2014, ECCV.

[7]  Rahul Sukthankar,et al.  Behavior Discovery and Alignment of Articulated Object Classes from Unstructured Video , 2015, International Journal of Computer Vision.

[8]  Fernando De la Torre,et al.  Unsupervised Temporal Commonality Discovery , 2012, ECCV.

[9]  Ying Wu,et al.  Discovering the Thematic Object in Commercial Videos , 2011, IEEE MultiMedia.

[10]  Junsong Yuan,et al.  From Keyframes to Key Objects: Video Summarization by Representative Object Proposal Selection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Gang Yu,et al.  Action Search by Example Using Randomized Visual Vocabularies , 2013, IEEE Transactions on Image Processing.

[12]  Pascal Fua,et al.  SLIC Superpixels Compared to State-of-the-Art Superpixel Methods , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Ying Wu,et al.  Spatial Random Partition for Common Visual Pattern Discovery , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[14]  Patrick Bouthemy,et al.  Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Tao Xiang,et al.  Weakly Supervised Action Detection , 2011, BMVC.

[17]  Vladimir Kolmogorov,et al.  An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision , 2001, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Nanning Zheng,et al.  Video Object Discovery and Co-Segmentation with Extremely Weak Supervision , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Gang Yu,et al.  Unsupervised random forest indexing for fast action search , 2011, CVPR 2011.

[21]  Derek Hoiem,et al.  Category Independent Object Proposals , 2010, ECCV.

[22]  Marcello Pelillo,et al.  Dominant Sets and Pairwise Clustering , 2007 .

[23]  Rahul Sukthankar,et al.  Discovering the Physical Parts of an Articulated Object Class from Multiple Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Rahul Sukthankar,et al.  Articulated motion discovery using pairs of trajectories , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[26]  Junsong Yuan,et al.  Discovering Thematic Patterns in Videos via Cohesive Sub-graph Mining , 2011, 2011 IEEE 11th International Conference on Data Mining.

[27]  Rabab Kreidieh Ward,et al.  Object-Based Multiple Foreground Video Co-Segmentation via Multi-State Selection Graph , 2015, IEEE Transactions on Image Processing.

[28]  Nanning Zheng,et al.  Video Object Discovery and Co-Segmentation with Extremely Weak Supervision , 2017, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Ling-Yu Duan,et al.  Query-Adaptive Small Object Search Using Object Proposals and Shape-Aware Descriptors , 2016, IEEE Transactions on Multimedia.

[30]  Xiaoming Liu,et al.  Sports Videos in the Wild (SVW): A video dataset for sports analysis , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[31]  Jean Ponce,et al.  Unsupervised Object Discovery and Tracking in Video Collections , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Cordelia Schmid,et al.  Human Focused Action Localization in Video , 2010, ECCV Workshops.

[33]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[34]  Ming-Hsuan Yang,et al.  Semantic Co-segmentation in Videos , 2016, ECCV.

[35]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[36]  Yanxi Liu,et al.  GRASP Recurring Patterns from a Single View , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Andrew V. Goldberg,et al.  Finding a Maximum Density Subgraph , 1984 .

[38]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Cordelia Schmid,et al.  Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Cees Snoek,et al.  APT: Action localization proposals from dense trajectories , 2015, BMVC.

[41]  Zhuwen Li,et al.  Video Co-segmentation for Meaningful Action Extraction , 2013, 2013 IEEE International Conference on Computer Vision.

[42]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Cordelia Schmid,et al.  Spatio-temporal Object Detection Proposals , 2014, ECCV.

[44]  Jiebo Luo,et al.  Mining Recurring Events Through Forest Growing , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[45]  Baoxin Li,et al.  Fusing disparate object signatures for salient object detection in video , 2017, Pattern Recognit..

[46]  Mahmood Fathy,et al.  Multi-label Discriminative Weakly-Supervised Human Activity Recognition and Localization , 2014, ACCV.

[47]  Gang Yu,et al.  Fast action proposals for human action detection and search , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Cees Snoek,et al.  Spot On: Action Localization from Pointly-Supervised Proposals , 2016, ECCV.