论文信息 - Cost-Sensitive Top-Down/Bottom-Up Inference for Multiscale Activity Recognition

Cost-Sensitive Top-Down/Bottom-Up Inference for Multiscale Activity Recognition

This paper addresses a new problem, that of multiscale activity recognition. Our goal is to detect and localize a wide range of activities, including individual actions and group activities, which may simultaneously co-occur in high-resolution video. The video resolution allows for digital zoom-in (or zoom-out) for examining fine details (or coarser scales), as needed for recognition. The key challenge is how to avoid running a multitude of detectors at all spatiotemporal scales, and yet arrive at a holistically consistent video interpretation. To this end, we use a three-layered AND-OR graph to jointly model group activities, individual actions, and participating objects. The AND-OR graph allows a principled formulation of efficient, cost-sensitive inference via an explore-exploit strategy. Our inference optimally schedules the following computational processes: 1) direct application of activity detectors --- called α process; 2) bottom-up inference based on detecting activity parts --- called β process; and 3) top-down inference based on detecting activity context --- called γ process. The scheduling iteratively maximizes the log-posteriors of the resulting parse graphs. For evaluation, we have compiled and benchmarked a new dataset of high-resolution videos of group and individual activities co-occurring in a courtyard of the UCLA campus.

[1] Silvio Savarese,et al. Learning context for collective activity recognition , 2011, CVPR 2011.

[2] Andrew Zisserman,et al. Efficient Visual Search for Objects in Videos , 2008, Proceedings of the IEEE.

[3] Martial Hebert,et al. Representing Pairwise Spatial and Temporal Relations for Action Recognition , 2010, ECCV.

[4] Dong Xu,et al. Action recognition using context and appearance distribution features , 2011, CVPR 2011.

[5] Mohamed R. Amer,et al. Sum-product networks for modeling activities with stochastic structure , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[6] Emal Pasarly. Time , 2011, Encyclopedia of Evolutionary Psychological Science.

[7] Jake K. Aggarwal,et al. Stochastic Representation and Recognition of High-Level Group Activities , 2011, International Journal of Computer Vision.

[8] Luc Van Gool,et al. A Hough transform-based voting framework for action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9] Thomas Deselaers,et al. ClassCut for Unsupervised Class Segmentation , 2010, ECCV.

[10] Benjamin Z. Yao,et al. Learning deformable action templates from cluttered videos , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[11] Larry S. Davis,et al. Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[12] Silvio Savarese,et al. What are they doing? : Collective activity classification using spatio-temporal relationship among people , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[13] David A. McAllester,et al. Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14] Cordelia Schmid,et al. Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[15] Yang Wang,et al. Beyond Actions: Discriminative Models for Contextual Group Activities , 2010, NIPS.

[16] Benjamin Z. Yao,et al. Unsupervised learning of event AND-OR grammar and semantics from video , 2011, 2011 International Conference on Computer Vision.

[17] Song-Chun Zhu,et al. A Numerical Study of the Bottom-Up and Top-Down Inference Processes in And-Or Graphs , 2011, International Journal of Computer Vision.

[18] Mohamed R. Amer,et al. A chains model for localizing participants of group activities in videos , 2011, 2011 International Conference on Computer Vision.