Robust relative attributes for human action recognition

High-level semantic feature is important to recognize human action. Recently, relative attributes, which are used to describe relative relationship, have been proposed as one of high-level semantic features and have shown promising performance. However, the training process is very sensitive to noises and moreover it is not robust to zero-shot learning. In this paper, to overcome these drawbacks, we propose a robust learning framework using relative attributes for human action recognition. We simultaneously add Sigmoid and Gaussian envelops into the loss objective. In this way, the influence of outliers will be greatly reduced in the process of optimization, thus improving the accuracy. In addition, we adopt Gaussian Mixture models for better fitting the distribution of actions in rank score space. Correspondingly, a novel transfer strategy is proposed to evaluate the parameters of Gaussian Mixture models for unseen classes. Our method is verified on three challenging datasets (KTH, UIUC and HOLLYWOOD2), and the experimental results demonstrate that our method achieves better results than previous methods in both zero-shot classification and traditional recognition task for human action recognition.

[1]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[2]  Yang Wang,et al.  A Discriminative Latent Model of Object Classes and Attributes , 2010, ECCV.

[3]  Andrew Gilbert,et al.  Fast realistic multi-action recognition using mined dense spatio-temporal features , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[4]  Fernando Pérez-Cruz,et al.  Empirical risk minimization for support vector classifiers , 2003, IEEE Trans. Neural Networks.

[5]  Alexander C. Berg,et al.  Automatic Attribute Discovery and Characterization from Noisy Web Data , 2010, ECCV.

[6]  Yoshua Bengio,et al.  Zero-data Learning of New Tasks , 2008, AAAI.

[7]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[8]  Dong Han,et al.  Selection and context for action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[9]  Thomas B. Moeslund,et al.  Selective spatio-temporal interest points , 2012, Comput. Vis. Image Underst..

[10]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Ruediger Dillmann,et al.  Human Motion Analysis: A Review , 1997 .

[12]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[13]  Mubarak Shah,et al.  Learning semantic visual vocabularies using diffusion distance , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Shree K. Nayar,et al.  Attribute and simile classifiers for face verification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Ivan Laptev,et al.  Improving bag-of-features action recognition with non-local cues , 2010, BMVC.

[17]  Jake K. Aggarwal,et al.  Human Motion Analysis: A Review , 1999, Comput. Vis. Image Underst..

[18]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[19]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[20]  Du Tran,et al.  Human Activity Recognition with Metric Learning , 2008, ECCV.

[21]  Yang Yang,et al.  Learning semantic visual vocabularies using diffusion distance , 2009, CVPR.

[22]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[25]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[26]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[27]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[28]  Jaime G. Carbonell,et al.  Fast learning of document ranking functions with the committee perceptron , 2008, WSDM '08.

[29]  Stefano Soatto,et al.  Tracklet Descriptors for Action Modeling and Video Analysis , 2010, ECCV.

[30]  Ying Wu,et al.  Action recognition with multiscale spatio-temporal contexts , 2011, CVPR 2011.

[31]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[32]  Mubarak Shah,et al.  Learning human actions via information maximization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Kristen Grauman,et al.  Sharing features between objects and their attributes , 2011, CVPR 2011.

[34]  Larry S. Davis,et al.  Recognizing actions by shape-motion prototype trees , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[35]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[36]  Jieping Ye,et al.  Multi-Task Feature Learning Via Efficient l2, 1-Norm Minimization , 2009, UAI.

[37]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[38]  Chunheng Wang,et al.  Action Recognition Using Context-Constrained Linear Coding , 2012, IEEE Signal Processing Letters.

[39]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[41]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Juan Carlos Niebles,et al.  Spatial-Temporal correlatons for unsupervised action classification , 2008, 2008 IEEE Workshop on Motion and video Computing.

[43]  Mubarak Shah,et al.  Actions sketch: a novel action representation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[44]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[45]  Ramakant Nevatia,et al.  Single View Human Action Recognition using Key Pose Matching and Viterbi Path Searching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.