论文信息 - SAVE: A framework for semantic annotation of visual events

SAVE: A framework for semantic annotation of visual events

In this paper we propose a framework that performs automatic semantic annotation of visual events (SAVE). This is an enabling technology for content-based video annotation, query and retrieval with applications in Internet video search and video data mining. The method involves identifying objects in the scene, describing their inter-relations, detecting events of interest, and representing them semantically in a human readable and query-able format. The SAVE framework is composed of three main components. The first component is an image parsing engine that performs scene content extraction using bottom-up image analysis and a stochastic attribute image grammar, where we define a visual vocabulary from pixels, primitives, parts, objects and scenes, and specify their spatio-temporal or compositional relations; and a bottom-up top-down strategy is used for inference. The second component is an event inference engine, where the video event markup language (VEML) is adopted for semantic representation, and a grammar-based approach is used for event analysis and detection. The third component is the text generation engine that generates text report using head-driven phrase structure grammar (HPSG). The main contribution of this paper is a framework for an end-to-end system that infers visual events and annotates a large collection of videos. Experiments with maritime and urban scenes indicate the feasibility of the proposed approach.

[1] Claudio S. Pinhanez,et al. Human action detection using PNF propagation of temporal constraints , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[2] Kevin Knight,et al. Unification: a multidisciplinary survey , 1989, CSUR.

[3] Nong Sang,et al. Bayesian Inference for Layer Representation with Mixed Markov Random Field , 2007, EMMCVPR.

[4] Ivan A. Sag,et al. Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[5] Jay Earley,et al. An efficient context-free parsing algorithm , 1970, Commun. ACM.

[6] Bernhard Lorenz,et al. Ontology of Transportation Networks , 2005 .

[7] Hans-Hellmut Nagel,et al. Algorithmic characterization of vehicle trajectories from image sequences by motion verbs , 1991, Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8] Irfan Essa,et al. Recognizing Multitasked Activities using Stochastic Context-Free Grammar , 2001 .

[9] Adrian Barbu,et al. Graph partition by Swendsen-Wang cuts , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[10] Thomas Serre,et al. Object recognition with features inspired by visual cortex , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[11] Zhuowen Tu,et al. Parsing Images into Regions, Curves, and Curve Groups , 2006, International Journal of Computer Vision.

[12] Kunio Fukunaga,et al. Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions , 2002, International Journal of Computer Vision.

[13] Ramakant Nevatia,et al. Bayesian framework for video surveillance application , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[14] Ramakant Nevatia,et al. An Ontology for Video Event Representation , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[15] Aaron F. Bobick,et al. A State-Based Approach to the Representation and Recognition of Gesture , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[16] Mubarak Shah,et al. Learning, detection and representation of multi-agent events in videos , 2007, Artif. Intell..

[17] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).