Mayachitra Inc. team submitted runs for the TRECVID 2010 Multimedia Event Detection Pilot (MED) task evaluation. In this paper, we describe the preliminary set of results. The focus of this experiment for the Mayachitra Inc. team was to implement an end-to-end pilot system for multimedia event detection that (i) processes video,extracts and stores state-ofart video descriptors (ii) learns complex event models, and (iii) evaluates them on the test set in an efficient and effective manner. In this preliminary report, we summarize our findings on the performance of one of the important system components: the state-of-art activity detection approach. We have submitted two runs to NIST: • c raw 1: max-type fusion of the scores from binary detectors trained on the subset of visual words. • p base 1: weighted fusion of the individual scores from activity detector. and evaluated additional run: • c sel 1: cross-validation fusion of the activity detectors trained on the expanded set The performance of the runs varied significantly based on the training selection, and diversifying training set improves the detection scores. Overall, the activity recognition component has definitely showed potential in the overall event detection system for user-generated video collections. We will present a detailed analysis in the final notebook paper. 1. ACTION DESCRIPTORS Following the explosion of user-created video content, and lack of tools to efficiently index and retrieve them, the research community has made significant progress in advancing the use of static descriptors (i.e. visual descriptors extracted from video keyframes) to detect objects and scenes in automatic annotation pipeline, and to connect them to the events they describe [1, 2]. To describe a complete event, descriptors need to capture scene, objects, and their relations present, and the actual activity/action. The research effort of incorporating the activity recognition analysis in a scalable video analysis systems is still in its infancy. Lately, the computer vision community reported favorable results in action recognition domain as it extended traditional object recognition approaches to the spatio-temporal domain of video dataset [3, 4]. The actions are captured as spatiotemporal patterns in the local descriptor space. To effectively capture the actions in the user-generated video content, such as YouTube video dataset, we must consider the following: • The size of video archive is overwhelming. • User-created video content is widely diverse in content capture (camera settings), content presentation (event flow), and content editing. • Actions that need to be detected vary in scale of details that need to be captured. This boils down to the following demands on the selection of the state-of-art spatio– temporal descriptor: (i) the descriptor extraction needs to be efficient (ii) the features extracted need to be time and scale invariant, (iii) the extracted features needs to capture rich semantics of action events in video archives. For the TRECVID MED pilot task, we use the dense, scaleinvariant, spatio-temporal Hes-STIP detector of Willems et al. [5]. This detector responds to spatio-temporal blobs within a video, based on an approximation of the determinant of the Hessian. These features are scale-invariant (both in temporal and spatial domain), and relatively dense comparing with other spatio-temporal features. 1.1. Spatio-temporal interest point detection The spatio-temporal scale space L is defined by a spatio-temporal signal f convolving with a Gaussian kernel g(·; o2, r2), where o represents the spatial and r the temporal scale. L(·; o2, r2) = g(·; o2, r2) ∗ f(·) Willems et al. [5] used the Hession Matrix for the point detection task. The Hession Matrix H is defined as the square matrix of all second-order partial derivatives of L. H = Lxx Lxy Lxt Lyx Lyy Lyt Ltx Lty Ltt The Gaussian second-order derivatives in the spatiotemporal space (Dxx, Dyy, Dtt, Dxy , Dtx and Dty) can be approximate using box-filters [6]. All six derivatives can be computed by rotated version of only two different types of box filters. The box filters can be calculated efficiently using an integral representation of the video, [7]. The determinant of the matrix H defines the strength of a point of interest at certain scale. 1.2. SURF3D Descriptor The descriptor used by Willems et al. is an extension of the 2D SURF image descriptor [6]. To describe the interest point a rectangular volume with the dimension so x so x sr must be defined, where r represents the time scale, o the spatial scale and s is a magnification factor. The descriptor volume is divided into M x M x N subregion. Within each of these sub volumes 3 axis-aligned Box-Filters dx, dy , dt are calculated at uniform sample points. Every subregions is represented by the vector v = ( ∑ dx, ∑ dy, ∑ dy). The resulting descriptor is invariant to spatial rotation if the dominant orientation has been taken into account and he is invariant to spatial and temporal scale if the used Box-Filters have had the size o x o x r. We use this dense, scale-invariant, spatio-temporal HesSTIP detector and SURF3D descriptor in our activity detection pipeline. 2. ACTIVITY RECOGNITION An event for MED 2010 is “an activity-centered happening that involves people engaged in process-driven actions with other people and/or objects at a specific place and time”. In this preliminary report, we present the activity recognition component of our system.
[1]
Luc Van Gool,et al.
An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector
,
2008,
ECCV.
[2]
B. S. Manjunath,et al.
Hierarchical scene understanding exploiting automatically derived contextual data
,
2010,
Defense + Commercial Sensing.
[3]
Martial Hebert,et al.
Efficient visual event detection using volumetric features
,
2005,
Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.
[4]
Christopher Hunt,et al.
Notes on the OpenSURF Library
,
2009
.
[5]
Krystian Mikolajczyk,et al.
Action recognition with motion-appearance vocabulary forest
,
2008,
2008 IEEE Conference on Computer Vision and Pattern Recognition.
[6]
Mubarak Shah,et al.
Human Action Recognition in Videos Using Kinematic Features and Multiple Instance Learning
,
2010,
IEEE Transactions on Pattern Analysis and Machine Intelligence.
[7]
John R. Smith,et al.
Large-scale concept ontology for multimedia
,
2006,
IEEE MultiMedia.
[8]
Paul Over,et al.
High-level feature detection from video in TRECVid: a 5-year retrospective of achievements
,
2009
.
[9]
Chih-Jen Lin,et al.
LIBSVM: A library for support vector machines
,
2011,
TIST.
[10]
John R. Smith,et al.
Cluster-based data modeling for semantic video search
,
2007,
CIVR '07.