Extract the Relational Information of Static Features and Motion Features for Human Activities Recognition in Videos

Both static features and motion features have shown promising performance in human activities recognition task. However, the information included in these features is insufficient for complex human activities. In this paper, we propose extracting relational information of static features and motion features for human activities recognition. The videos are represented by a classical Bag-of-Word (BoW) model which is useful in many works. To get a compact and discriminative codebook with small dimension, we employ the divisive algorithm based on KL-divergence to reconstruct the codebook. After that, to further capture strong relational information, we construct a bipartite graph to model the relationship between words of different feature set. Then we use a k-way partition to create a new codebook in which similar words are getting together. With this new codebook, videos can be represented by a new BoW vector with strong relational information. Moreover, we propose a method to compute new clusters from the divisive algorithm's projective function. We test our work on the several datasets and obtain very promising results.

[1]  Chunheng Wang,et al.  Action recognition via structured codebook construction , 2014, Signal Process. Image Commun..

[2]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Tony Lindeberg,et al.  Scale Invariant Feature Transform , 2012, Scholarpedia.

[4]  S DhillonInderjit,et al.  A divisive information theoretic feature clustering algorithm for text classification , 2003 .

[5]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[6]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[7]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[9]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[10]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[11]  Stefano Soatto,et al.  Localizing Objects with Smart Dictionaries , 2008, ECCV.

[12]  Alexander G. Hauptmann,et al.  MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[13]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[14]  Dong Liu,et al.  Joint audio-visual bi-modal codewords for video event detection , 2012, ICMR.

[15]  Silvio Savarese,et al.  Cross-view action recognition via view knowledge transfer , 2011, CVPR 2011.

[16]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.