IRIT @ TRECVid HLF 2009 - Audio to the Rescue

This notebook paper describes the six runs submitted for the first participation of IRIT at TRECVid 2009 High-Level Feature Extraction task. They were submitted in an attempt to start answering two research questions: 1. Can acoustic information be of any help in this (historically) video-only task? 2. Are Support Vector Machines robust enough to deal with noisy and unbalanced datasets? The six submitted runs can be described and compared as follows: • Run 6 (A IRIT V Mono 6) SVM-based late-fusion of visual descriptors • Run 4 (A IRIT AV Mono 4) SVM-based late-fusion of visual and audio descriptors • Run 5 (A IRIT V Poly 5) Same as run 6 except scores from other concepts are added during the late-fusion process • Run 3 (A IRIT AV Poly 3) Same as run 4 except scores from other concepts are added during the late-fusion process • Run 1 (A IRIT AV BestAvg 1) For each concept, uses the best of runs 3 and 4 • Run 2 (A IRIT AV BestMax 2) Difference between runs 1 and 2 is explained in Section 4.3. Taking into account the relatively poor performance of the six submitted runs (average precision ranges between 0.022 and 0.027), no definitive answer can be given to the first question: audio definitely helps for some concepts and is useless for others, and additional work has to be done on how to use SVM efficiently in this task.