A decade of work on semantic concept detection from video: TRECVid’s contribution

We are interested in content-based tasks for visual media, especially video. Automatic assignment of semantic tags representing visual or multimodal concepts (“high-level features”) to video segments is a fundamental technology for filtering, categorization, browsing, search, and other video exploitation activities. Approaches based on using metadata, automatically recognised speech, image-image similarity, and object detection and matching all have their contributions but using semantic concepts is the one which is most scalable and has an interesting intersection with language. TRECVid is an annual benchmarking activity which has been on-going for 10 years [1]. It has focused on many tasks such as shot boundary detection, summarisation, and various forms of search. The 2012 running of TRECVid addressed 6 tasks, including Semantic INdexing (SIN) of video. The rationale for the SIN task is as follows. Given a test collection of video, a master shot reference, and a set of concept definitions, return for each concept a list of at most 2,000 shot IDs from the test collection ranked according to their likelihood of containing the concept. The data set is 291 hours of short videos with durations between 10 seconds and 3.5 minutes where there exists at least 4 positive samples of each concept. Of the 346 test concepts, a subset of 50, not known by participants at submission time, are manually judged. This task has run since the start of TRECVid and in 2012, the task is similar to previous years, except in scale. Previously, for the majority of participants the approach was to build independent concept detectors based on training a machine learning toolkit on a set of positive and negative example shots, using low-level features from the images, as described in [2]. To cater for this, a collaborative annotation of the concepts is carried out each year to build a concept bank on which to train classifiers. The TRECVid organizers also provide a set of relations between the concepts of two types: A implies B and A excludes B. Relations that can be derived by transitivity are not included. Participants are free to use the relations or not and submissions are not required to comply with them. Some of the more advanced methods from participants use the annotations of non-evaluated concepts and the ontology relations to improve the detection of the evaluated concepts.