This notebook paper describes the six runs submitted for the first participation of IRIT at TRECVid 2009 High-Level Feature Extraction task. They were submitted in an attempt to start answering two research questions: 1. Can acoustic information be of any help in this (historically) video-only task? 2. Are Support Vector Machines robust enough to deal with noisy and unbalanced datasets? The six submitted runs can be described and compared as follows: • Run 6 (A IRIT V Mono 6) SVM-based late-fusion of visual descriptors • Run 4 (A IRIT AV Mono 4) SVM-based late-fusion of visual and audio descriptors • Run 5 (A IRIT V Poly 5) Same as run 6 except scores from other concepts are added during the late-fusion process • Run 3 (A IRIT AV Poly 3) Same as run 4 except scores from other concepts are added during the late-fusion process • Run 1 (A IRIT AV BestAvg 1) For each concept, uses the best of runs 3 and 4 • Run 2 (A IRIT AV BestMax 2) Difference between runs 1 and 2 is explained in Section 4.3. Taking into account the relatively poor performance of the six submitted runs (average precision ranges between 0.022 and 0.027), no definitive answer can be given to the first question: audio definitely helps for some concepts and is useless for others, and additional work has to be done on how to use SVM efficiently in this task.
[1]
怡土 順一.
OpenCV: Open Computer Vision Library (特集 高度な開発支援に! 画像処理ライブラリ)
,
2011
.
[2]
Elie el Khoury,et al.
Speaker Diarization: Towards a More Robust and Portable System
,
2007,
2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.
[3]
T. Houtgast,et al.
A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria
,
1985
.
[4]
Chih-Jen Lin,et al.
LIBSVM: A library for support vector machines
,
2011,
TIST.
[5]
G LoweDavid,et al.
Distinctive Image Features from Scale-Invariant Keypoints
,
2004
.
[6]
Arun Ross,et al.
Score normalization in multimodal biometric systems
,
2005,
Pattern Recognit..
[7]
Andrea Vedaldi,et al.
Vlfeat: an open and portable library of computer vision algorithms
,
2010,
ACM Multimedia.
[8]
C. Mailhes,et al.
A continuous voicing parameter in the frequency domain
,
2008
.
[9]
Hideki Kawahara,et al.
YIN, a fundamental frequency estimator for speech and music.
,
2002,
The Journal of the Acoustical Society of America.
[10]
Stéphane Ayache,et al.
Video Corpus Annotation Using Active Learning
,
2008,
ECIR.