Discriminative key-segment model for interaction detection

Automatic activity detection in videos has several applications in visual surveillance, video retrieval, and human-computer interaction. The task, at its core, requires expressive models of activities. The models that represent activities as arrangements of key components are generally more descriptive and robust to challenges such as occlusion, clutter, and high intraclass variability. They can thus lead to improved classification performance. Following this idea, we model human-object interactions as sequences of locally discriminative temporal segments capturing objects appearance and their interrelations. In a two-stage pipeline, we coarsely localize humans and objects in long videos. We then more closely examine their content using our key-segment model trained in the latent SVM framework. We evaluate our approach on VIRAT Ground Dataset Release 2.0 for detecting instances of humanvehicle interactions. Results show that our key-segment model significantly outperforms the common global Bag of Words approach.

[1]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[2]  Bo Gao,et al.  A discriminative key pose sequence model for recognizing human interactions , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[3]  Yang Wang,et al.  Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[5]  Song-Chun Zhu,et al.  Inferring "Dark Matter" and "Dark Energy" from Videos , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Junji Yamato,et al.  Recognizing human action in time-sequential images using hidden Markov model , 1992, Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Amit K. Roy-Chowdhury,et al.  Context-Aware Modeling and Recognition of Activities in Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Mubarak Shah,et al.  Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[10]  Andrew Zisserman,et al.  Efficient Additive Kernels via Explicit Feature Maps , 2012, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Leonid Sigal,et al.  Poselet Key-Framing: A Model for Human Activity Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.