TRECVID 2007 Search Tasks by NUS-ICT

This paper describes the details of our systems for our automated and interactive search in TRECVID 2007. The shift from news video to documentary video this year has prompted a series of changes in processing techniques from that developed over the past few years. For the automated search task, we employ our previous querydependent retrieval which automatically discovers query class and query-high-level-features (query-HLF) to fuse available multimodal features. Different from previous works, our system this year gives more emphasis to visual features such as color, texture and motion in the video source. The reasons are: (a) given the low quality of ASR text and the more visual and motion oriented queries, we expect the visual features to be as discriminating as text feature; and (b) the appropriate use of motion features is highly effective for queries as they are able to model intra-frame changes. For the interactive task, we first utilize the results from the automated search results for user feedback. The user is able to make use of our intuitive retrieval interface with a variety of relevance feedback techniques to refine the search results. In addition, we introduce the motion-icons, which allow users to see a dynamic series of keyframes instead of a single keyframe during assessment. Results show that the approach can help in providing better discrimination. 1. INTRODUCTION The overall framework of our video search and retrieval for both automated and interactive system is shown in Figure 1. There are two main stages: the auto search stage and the interactive search stage. The retrieval starts with the user query, which can simply be a free text query; or coupled with image and video (multimedia query). The auto search first processes the multimedia query and performs the retrieval. The emphasis is on understanding the query to infer the roles of HLF, motion and visual features in query processing. For the interactive search, the user will make use of the automated search results to indicate whether the results are indeed relevant or otherwise. The emphasis is on designing a high performance feedback system, from which users can make use of several autofeedback and active learning functions to improve the retrieval performance. The domain of corpus for this year is the Dutch documentary video. The videos are preprocessed, segmented into shots with the speech track automatically recognized using a commercial automated speech recognition (ASR) engine and translated to English text. As a result of ASR and translation, the quality of ASR text is quite low. This, coupled with a large number of visual and motion oriented queries, suggests that ASR text may not play a critical role in the retrieval process. In fact, visual and motion information will be as important as text, as we move from news video to Dutch documentary video retrievals.