Zero-shot Action Recognition via Empirical Maximum Mean Discrepancy

In the recent years, with a rapid development of video capture devices (Kinect, etc) and online video providers (Youtube, Youku, etc), the number of video data, categories of actions and complexities of videos’ content are explosively increasing. Thus, for the traditional supervised human action recognition, it’s expensive and intractable to annotate the huge number of videos manually. In the consideration of this situation, researchers dedicate to predict the label of an unseen video using disjoint training videos, which is referred to as zero-shot learning for human action recognition. In this paper, we propose a novel zero-shot action recognition method on the basis of linear regression model, which aims at learning an appropriate visual-to-semantic mapping to project unseen video into a proper semantic representation. On the one hand, we take advantage of both training videos and testing videos and preserve their structural information simultaneously. On the other hand, we introduce an empirical maximum mean discrepancy regularized term into our objective function, which will shorten the differences between learned semantic representations and known action prototypes, and the domain shift problem would be relieved accordingly. Experiments on HMDB51 datasets demonstrate the effectiveness of our novel method.

[1]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[2]  Xun Xu,et al.  Transductive Zero-Shot Action Recognition by Word-Vector Embedding , 2015, International Journal of Computer Vision.

[3]  Shaogang Gong,et al.  Transductive Multi-view Embedding for Zero-Shot Recognition and Annotation , 2014, ECCV.

[4]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[5]  Limin Wang,et al.  Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice , 2014, Comput. Vis. Image Underst..

[6]  Venkatesh Saligrama,et al.  Zero-Shot Learning via Semantic Similarity Embedding , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Philip S. Yu,et al.  Transfer Sparse Coding for Robust Image Representation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[11]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[13]  Shaogang Gong,et al.  Semantic embedding space for zero-shot action recognition , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[14]  Yi Yang,et al.  Exploring Semantic Inter-Class Relationships (SIR) for Zero-Shot Action Recognition , 2015, AAAI.

[15]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[16]  Yunde Jia,et al.  Interactive Phrases: Semantic Descriptionsfor Human Interaction Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.