Monte Carlo Tree Search for Scheduling Activity Recognition

This paper addresses recognition of human activities with stochastic structure, characterized by variable space-time arrangements of primitive actions, and conducted by a variable number of actors. Our approach classifies the activity of interest as well as identifies the relevant foreground in the video. Each activity representation is considered as a mixture distribution of BoWs captured by a Sum-Product Network (SPN). In our approach, SPN represents a linear mixture of many bags-of-words (BoWs) where each BoW represents an important foreground part of the activity. This mixture distribution is efficiently computed by organizing the BoWs in a hierarchy, where children BoWs are nested within parent BoWs. SPN allows us to model this mixture since it consists of terminal nodes representing BoWs, product nodes, and sum nodes organized in a number of layers. The products are aimed at encoding particular configurations of primitive actions, and the sums serve to capture their alternative configurations. SPN inference amounts to parsing the SPN graph, which yields the most probable explanation (MPE) of the video foreground. SPN inference has linear complexity in the number of nodes, under fairly general conditions, enabling fast and scalable recognition. The connectivity of SPN and the parameters of BoW distributions are learned under weak supervision using a variational EM algorithm. For our evaluation, we have compiled and annotated a new Volleyball dataset. Our classification accuracy and localization results are superior to those of the state of the art on current benchmarks as well as our Volleyball datasets.

[1]  Larry S. Davis,et al.  Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos , 2009, CVPR.

[2]  Larry S. Davis,et al.  Combining Per-frame and Per-track Cues for Multi-person Action Recognition , 2012, ECCV.

[3]  Yang Wang,et al.  Discriminative Latent Models for Recognizing Contextual Group Activities , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Mohamed R. Amer,et al.  Sum-product networks for modeling activities with stochastic structure , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[6]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[7]  Yunde Jia,et al.  Parsing video events with goal inference and intent prediction , 2011, 2011 International Conference on Computer Vision.

[8]  Silvio Savarese,et al.  What are they doing? : Collective activity classification using spatio-temporal relationship among people , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[9]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[10]  Song-Chun Zhu,et al.  A Numerical Study of the Bottom-Up and Top-Down Inference Processes in And-Or Graphs , 2011, International Journal of Computer Vision.

[11]  Mohamed R. Amer,et al.  A chains model for localizing participants of group activities in videos , 2011, 2011 International Conference on Computer Vision.

[12]  Mohamed R. Amer,et al.  Cost-Sensitive Top-Down/Bottom-Up Inference for Multiscale Activity Recognition , 2012, ECCV.

[13]  Silvio Savarese,et al.  A Unified Framework for Multi-target Tracking and Collective Activity Recognition , 2012, ECCV.

[14]  Benjamin Z. Yao,et al.  Unsupervised learning of event AND-OR grammar and semantics from video , 2011, 2011 International Conference on Computer Vision.

[15]  Martial Hebert,et al.  Representing Pairwise Spatial and Temporal Relations for Action Recognition , 2010, ECCV.

[16]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[17]  Charless C. Fowlkes,et al.  Globally-optimal greedy algorithms for tracking a variable number of objects , 2011, CVPR 2011.