TRECVID 2003 Experiments at Media Team Oulu and VTT

MediaTeam Oulu and VTT Technical Research Centre of Finland participated jointly in semantic feature extraction, manual search and interactive search tasks of TRECVID 2003. We participated to the semantic feature extraction by submitting results to 15 out of the 17 defined semantic categories. Our approach utilized spatio-temporal visual features based on correlations of quantized gradient edges and color values together with several physical features from the audio signal. Most recent version of our Video Browsing and Retrieval System (VIRE) contains an interactive cluster-temporal browser of video shots exploiting three semantic levels of similarity: visual, conceptual and lexical. The informativeness of the browser was enhanced by incorporating automatic speech transcription texts into the visual views based on shot key frames. The experimental results for interactive search task were obtained by conducting a user experiment of eight people with two system configurations: browsing by (I) visual features only (visual and conceptual browsing was allowed, no browsing with ASR text) or (II) visual features and ASR text (all semantic browsing levels were available and ASR-text content was visible). The interactive results using ASR-based features were better than the results using only visual features. This indicates the importance of successful integration of both visual and textual features for video browsing. In contrast to previous version of VIRE which performed early feature fusion by training unsupervised self-organizing maps, newest version capitalises on late fusion of features queries, which was evaluated in manual search task. This paper gives an overview of the developed system and summarises the results.

[1]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[2]  Ronald W. Schafer,et al.  Real-time digital hardware pitch detector , 1976 .

[3]  B. Kedem,et al.  Spectral analysis and discrimination by zero-crossings , 1986, Proceedings of the IEEE.

[4]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[5]  Sargur N. Srihari,et al.  Decision Combination in Multiple Classifier Systems , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Eric D. Scheirer,et al.  Tempo and beat analysis of acoustic musical signals. , 1998, The Journal of the Acoustical Society of America.

[7]  Anssi Klapuri,et al.  Sound onset detection by applying psychoacoustic knowledge , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[8]  Massimiliano Pontil,et al.  Face Detection in Still Gray Images , 2000 .

[9]  Jani Penttilä,et al.  A SPEECH/MUSIC DISCRIMINATOR -BASED AUDIO BROWSER WITH A DEGREE OF CERTAINTY MEASURE , 2001 .

[10]  Timo Ojala,et al.  Semantic image retrieval with hsv correlograms , 2001 .

[11]  Mika Rautiainen,et al.  Temporal color correlograms for video retrieval , 2002, Object recognition supported by user interaction for service robots.

[12]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[13]  Timo Ojala,et al.  TREC 2002 Video Track Experiments at MediaTeam Oulu and VTT , 2002, TREC.

[14]  Mika Rautiainen,et al.  Detecting Semantic Concepts from Video Using Temporal Gradients and Audio Classification , 2003, CIVR.

[15]  Tapio Seppänen,et al.  Prosody-based classification of emotions in spoken finnish , 2003, INTERSPEECH.