In this paper, we propose an approach which attempts to solve the problem of surveillance event detection, assuming that we know the definition of the events. To facilitate the discussion, we first define two concepts. The event of interest refers to the event that the user requests the system to detect; and the background activities are any other events in the video corpus. This is an unsolved problem due to many factors as listed below:
1) Occlusions and clustering: The surveillance scenes which are of significant interest at locations such as airports, railway stations, shopping centers are often crowded, where occlusions and clustering of people are frequently encountered. This significantly affects the feature extraction step, and for instance, trajectories generated by object tracking algorithms are usually not robust under such a situation.
2) The requirement for real time detection: The system should process the video fast enough in both of the feature extraction and the detection step to facilitate real time operation.
3) Massive size of the training data set: Suppose there is an event that lasts for 1 minute in a video with a frame rate of 25fps, the number of frames for this events is 60X25 = 1500. If we want to have a training data set with many positive instances of the event, the video is likely to be very large in size (i.e. hundreds of thousands of frames or more). How to handle such a large data set is a problem frequently encountered in this application.
4) Difficulty in separating the event of interest from background activities: The events of interest often co-exist with a set of background activities. Temporal groundtruth typically very ambiguous, as it does not distinguish the event of interest from a wide range of co-existing background activities. However, it is not practical to annotate the locations of the events in large amounts of video data. This problem becomes more serious in the detection of multi-agent interactions, since the location of these events can often not be constrained to within a bounding box.
5) Challenges in determining the temporal boundaries of the events: An event can occur at any arbitrary time with an arbitrary duration. The temporal segmentation of events is difficult and ambiguous, and also affected by other factors such as occlusions.
[1]
Ramesh Nallapati,et al.
Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora
,
2009,
EMNLP.
[2]
Tao Xiang,et al.
Identifying Rare and Subtle Behaviors: A Weakly Supervised Joint Topic Model
,
2011,
IEEE Transactions on Pattern Analysis and Machine Intelligence.
[3]
Kuo-Chin Fan,et al.
Motion Flow-Based Video Retrieval
,
2007,
IEEE Transactions on Multimedia.
[4]
Mark Steyvers,et al.
Finding scientific topics
,
2004,
Proceedings of the National Academy of Sciences of the United States of America.
[5]
Yonghong Tian,et al.
PKU-NEC @TRECVID2011 SED: Sequence-Based Event Detection in Surveillance Video
,
2011,
TRECVID.
[6]
P. N. Tudor.
MPEG-2 video compression
,
1995
.
[7]
Koichi Shinoda,et al.
TokyoTech+Canon at TRECVID 2011
,
2011,
TRECVID.
[8]
Michael I. Jordan,et al.
Latent Dirichlet Allocation
,
2001,
J. Mach. Learn. Res..