论文信息 - Action Recognition with Actons

Action Recognition with Actons

With the improved accessibility to an exploding amount of video data and growing demands in a wide range of video analysis applications, video-based action recognition/classification becomes an increasingly important task in computer vision. In this paper, we propose a two-layer structure for action recognition to automatically exploit a mid-level ``acton'' representation. The actons are learned via a new max-margin multi-channel multiple instance learning framework. The learned actons (with no requirement for detailed manual annotations) thus observe a property of being compact, informative, discriminative, and easy to scale. This is different from the standard unsupervised (e.g. k-means) or supervised (e.g. random forests) coding strategies in action recognition. Applying the learned actons in our two-layer structure yields the state-of-the-art classification performance on Youtube and HMDB51 datasets.

[1] Thomas Hofmann,et al. Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[2] Alan L. Yuille,et al. The Concave-Convex Procedure , 2003, Neural Computation.

[3] Dale Schuurmans,et al. Maximum Margin Clustering , 2004, NIPS.

[4] Cordelia Schmid,et al. Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[5] Cordelia Schmid,et al. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[6] Cordelia Schmid,et al. Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[7] Christopher Joseph Pal,et al. Activity recognition using the velocity histories of tracked keypoints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[8] Jiebo Luo,et al. Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[9] Cordelia Schmid,et al. Actions in context , 2009, CVPR.

[10] Liang Lin,et al. Trajectory parsing by cluster sampling in spatio-temporal graph , 2009, CVPR.

[11] Yihong Gong,et al. Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[12] Thorsten Joachims,et al. Cutting-plane training of structural SVMs , 2009, Machine Learning.

[13] William T. Freeman,et al. Latent hierarchical structural learning for object detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14] Frédéric Jurie,et al. Improving object classification using semantic attributes , 2010, BMVC.

[15] Ivan Laptev,et al. Improving bag-of-features action recognition with non-local cues , 2010, BMVC.

[16] Nazli Ikizler-Cinbis,et al. Object, Scene and Actions: Combining Multiple Features for Human Action Recognition , 2010, ECCV.

[17] William Brendel,et al. Activities as Time Series of Human Postures , 2010, ECCV.

[18] Fei Wang,et al. Maximum Margin Multiple Instance Clustering With Applications to Image and Text Clustering , 2011, IEEE Transactions on Neural Networks.

[19] Quoc V. Le,et al. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[20] Mubarak Shah,et al. A probabilistic representation for efficient large scale visual recognition tasks , 2011, CVPR 2011.

[21] Silvio Savarese,et al. Recognizing human actions by attributes , 2011, CVPR 2011.

[22] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[23] Lei Wang,et al. In defense of soft-assignment coding , 2011, 2011 International Conference on Computer Vision.

[24] Philip H. S. Torr,et al. Learning discriminative space-time actions from weakly labelled videos , 2012, BMVC.

[25] Iasonas Kokkinos,et al. Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[26] Chong-Wah Ngo,et al. Trajectory-Based Modeling of Human Actions with Motion Reference Points , 2012, ECCV.

[27] Rui Zhang,et al. Image Classification by Hierarchical Spatial Pooling with Partial Least Squares Analysis , 2012, BMVC.

[28] Cordelia Schmid,et al. Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[29] Zhuowen Tu,et al. Multiple clustered instance learning for histopathology cancer image classification, segmentation and clustering , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30] Jason J. Corso,et al. Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31] Zhuowen Tu,et al. Harvesting Mid-level Visual Concepts from Large-Scale Internet Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[32] 乔宇. Motionlets: Mid-Level 3D Parts for Human Motion Recognition , 2013 .

[33] Patrick Bouthemy,et al. Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[34] Zhuowen Tu,et al. Max-Margin Multiple-Instance Dictionary Learning , 2013, ICML.

[35] Feng Shi,et al. Sampling Strategies for Real-Time Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.