Aligned Dynamic-Preserving Embedding for Zero-Shot Action Recognition

Zero-shot learning (ZSL) typically explores a shared semantic space in order to recognize novel categories in the absence of any labeled training data. However, the traditional ZSL methods always suffer from serious domain shift problem in human action recognition. This is because: 1) existing ZSL methods are specifically designed for object recognition from static images, which do not capture the temporal dynamics of video sequences, and poor performances are always generated if those methods are directly applied to zero-shot action recognition; 2) these methods always blindly project the target data into a shared space using a semantic mapping obtained by the source data without any adaptation, in which the underlying structures of target data are ignored; and 3) severe inter-class variations exist in various action categories. The traditional ZSL methods do not take relationships across different categories into consideration. In this paper, we propose a novel aligned dynamic-preserving embedding (ADPE) model for zero-shot action recognition in a transductive setting. In our model, an adaptive embedding of target videos is learned, exploring the distributions of both the source and target data. An aligned regularization is further proposed to couple the centers of target semantic representations with their corresponding label prototypes in order to preserve the relationships across different categories. Most significantly, during our embedding, the temporal dynamics of video sequences are simultaneously preserved via exploiting the temporal consistency of video sequences and capturing the temporal evolution of successive segments of actions. Our model can effectively overcome the domain shift problem in zero-shot action recognition. The experiments on Olympic sports, HMDB51, and UCF101 datasets demonstrate the effectiveness of our model.

[1]  Xun Xu,et al.  Multi-Task Zero-Shot Action Recognition with Prioritised Data Augmentation , 2016, ECCV.

[2]  Limin Wang,et al.  Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice , 2014, Comput. Vis. Image Underst..

[3]  Ling Shao,et al.  Embedding Motion and Structure Features for Action Recognition , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[4]  Bernt Schiele,et al.  Transfer Learning in a Transductive Setting , 2013, NIPS.

[5]  Cees Snoek,et al.  Objects2action: Classifying and Localizing Actions without Any Video Example , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[7]  Xiaoqing Ding,et al.  Discriminative Dimensionality Reduction for Multi-Dimensional Sequences , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Yun Fu,et al.  Max-Margin Action Prediction Machine , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  James A. Reggia,et al.  Robust human action recognition via long short-term memory , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[10]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Changyin Sun,et al.  Action Recognition Using Nonnegative Action Component Representation and Sparse Basis Selection , 2014, IEEE Transactions on Image Processing.

[12]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[14]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[15]  Shaogang Gong,et al.  Semantic embedding space for zero-shot action recognition , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[16]  Changsong Liu,et al.  Discriminative Transformation for Multi-Dimensional Temporal Sequences , 2017, IEEE Transactions on Image Processing.

[17]  Kristen Grauman,et al.  Zero-shot recognition with unreliable attributes , 2014, NIPS.

[18]  Ying Wu,et al.  Learning Low-Dimensional Temporal Representations , 2018, ICML.

[19]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Yun Fu,et al.  A Discriminative Model with Multiple Temporal Scales for Action Prediction , 2014, ECCV.

[21]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[22]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[23]  Jiahuan Zhou,et al.  Unsupervised Hierarchical Dynamic Parsing and Encoding for Action Recognition , 2017, IEEE Transactions on Image Processing.

[24]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[25]  Qiuqi Ruan,et al.  Action Recognition Using Local Consistent Group Sparse Coding with Spatio-Temporal Structure , 2016, ACM Multimedia.

[26]  Philip S. Yu,et al.  Transfer Sparse Coding for Robust Image Representation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[28]  Cordelia Schmid,et al.  A Robust and Efficient Video Representation for Action Recognition , 2015, International Journal of Computer Vision.

[29]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[30]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[31]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[32]  Bingbing Ni,et al.  Zero-Shot Action Recognition with Error-Correcting Output Codes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yi Yang,et al.  Exploring Semantic Inter-Class Relationships (SIR) for Zero-Shot Action Recognition , 2015, AAAI.

[34]  Babak Saleh,et al.  Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[35]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[36]  Shaogang Gong,et al.  Unsupervised Domain Adaptation for Zero-Shot Learning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[38]  Shaogang Gong,et al.  Transductive Multi-view Embedding for Zero-Shot Recognition and Annotation , 2014, ECCV.

[39]  Fernando De la Torre,et al.  Max-Margin Early Event Detectors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Qiuqi Ruan,et al.  Zero-shot Action Recognition via Empirical Maximum Mean Discrepancy , 2018, 2018 14th IEEE International Conference on Signal Processing (ICSP).

[41]  Venkatesh Saligrama,et al.  Zero-Shot Learning via Semantic Similarity Embedding , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Peng Wang,et al.  Temporal Pyramid Pooling-Based Convolutional Neural Network for Action Recognition , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[43]  Yi Yang,et al.  Semantic Concept Discovery for Large-Scale Zero-Shot Event Detection , 2015, IJCAI.

[44]  Xun Xu,et al.  Transductive Zero-Shot Action Recognition by Word-Vector Embedding , 2015, International Journal of Computer Vision.

[45]  Yunde Jia,et al.  Interactive Phrases: Semantic Descriptionsfor Human Interaction Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[48]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.