Spatio-temporal Context Modeling for BoW-Based Video Classification

We propose an autocorrelation Cox process that extends the traditional bag-of-words representation to model the spatio-temporal context within a video sequence. Bag-of-words models are effective tools for representing a video by a histogram of visual words that describe local appearance and motion. A major limitation of this model is its inability to encode the spatio-temporal structure of visual words pertaining to the context of the video. Several works have proposed to remedy this by learning the pair wise correlations between words. However, pair wise analysis leads to a quadratic increase in the number of features, making the models prone to over fitting and challenging to learn from data. The proposed autocorrelation Cox process model encodes, in a compact way, the contextual information within a video sequence, leading to improved classification performance. Spatio-temporal autocorrelations of visual words estimated from the Cox process are coupled with the information gain feature selection to discern the essential structure for the classification task. Experiments on crowd activity and human action dataset illustrate that the proposed model achieves state-of-the-art performance while providing intuitive spatio-temporal descriptors of the video context.

[1]  J. Møller,et al.  Log Gaussian Cox Processes , 1998 .

[2]  Mubarak Shah,et al.  Abnormal crowd behavior detection using social force model , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Junsong Yuan,et al.  Sparse reconstruction cost for abnormal event detection , 2011, CVPR 2011.

[4]  Venkatesh Saligrama,et al.  Video anomaly detection based on local statistical aggregates , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[6]  David Elliott,et al.  In the Wild , 2010 .

[7]  Mubarak Shah,et al.  Chaotic invariants of Lagrangian particle trajectories for anomaly detection in crowded scenes , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  S. Gong,et al.  Recognising action as clouds of space-time interest points , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[11]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[12]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Alexander G. Hauptmann,et al.  MoSIFT : Recognizing Human Actions in Surveillance Videos CMU-CS-09-161 , 2009 .

[14]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[15]  Alexander G. Hauptmann,et al.  MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[16]  Juan Carlos Niebles,et al.  Spatial-Temporal correlatons for unsupervised action classification , 2008, 2008 IEEE Workshop on Motion and video Computing.

[17]  Huu-Giao Nguyen,et al.  Visual textures as realizations of multivariate log-Gaussian Cox processes , 2011, CVPR 2011.