Action Recognition with Actons

With the improved accessibility to an exploding amount of video data and growing demands in a wide range of video analysis applications, video-based action recognition/classification becomes an increasingly important task in computer vision. In this paper, we propose a two-layer structure for action recognition to automatically exploit a mid-level ``acton'' representation. The actons are learned via a new max-margin multi-channel multiple instance learning framework. The learned actons (with no requirement for detailed manual annotations) thus observe a property of being compact, informative, discriminative, and easy to scale. This is different from the standard unsupervised (e.g. k-means) or supervised (e.g. random forests) coding strategies in action recognition. Applying the learned actons in our two-layer structure yields the state-of-the-art classification performance on Youtube and HMDB51 datasets.

[1]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[2]  Alan L. Yuille,et al.  The Concave-Convex Procedure , 2003, Neural Computation.

[3]  Dale Schuurmans,et al.  Maximum Margin Clustering , 2004, NIPS.

[4]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[5]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[6]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Christopher Joseph Pal,et al.  Activity recognition using the velocity histories of tracked keypoints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[8]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[10]  Liang Lin,et al.  Trajectory parsing by cluster sampling in spatio-temporal graph , 2009, CVPR.

[11]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[12]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[13]  William T. Freeman,et al.  Latent hierarchical structural learning for object detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Frédéric Jurie,et al.  Improving object classification using semantic attributes , 2010, BMVC.

[15]  Ivan Laptev,et al.  Improving bag-of-features action recognition with non-local cues , 2010, BMVC.

[16]  Nazli Ikizler-Cinbis,et al.  Object, Scene and Actions: Combining Multiple Features for Human Action Recognition , 2010, ECCV.

[17]  William Brendel,et al.  Activities as Time Series of Human Postures , 2010, ECCV.

[18]  Fei Wang,et al.  Maximum Margin Multiple Instance Clustering With Applications to Image and Text Clustering , 2011, IEEE Transactions on Neural Networks.

[19]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[20]  Mubarak Shah,et al.  A probabilistic representation for efficient large scale visual recognition tasks , 2011, CVPR 2011.

[21]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[22]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[23]  Lei Wang,et al.  In defense of soft-assignment coding , 2011, 2011 International Conference on Computer Vision.

[24]  Philip H. S. Torr,et al.  Learning discriminative space-time actions from weakly labelled videos , 2012, BMVC.

[25]  Iasonas Kokkinos,et al.  Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Chong-Wah Ngo,et al.  Trajectory-Based Modeling of Human Actions with Motion Reference Points , 2012, ECCV.

[27]  Rui Zhang,et al.  Image Classification by Hierarchical Spatial Pooling with Partial Least Squares Analysis , 2012, BMVC.

[28]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[29]  Zhuowen Tu,et al.  Multiple clustered instance learning for histopathology cancer image classification, segmentation and clustering , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Zhuowen Tu,et al.  Harvesting Mid-level Visual Concepts from Large-Scale Internet Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  乔宇 Motionlets: Mid-Level 3D Parts for Human Motion Recognition , 2013 .

[33]  Patrick Bouthemy,et al.  Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Zhuowen Tu,et al.  Max-Margin Multiple-Instance Dictionary Learning , 2013, ICML.

[35]  Feng Shi,et al.  Sampling Strategies for Real-Time Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.