A Mid-Level Scene Change Representation Via Audiovisual Alignment

Scene is a series of semantic correlated video shots. An effective scene detection depends on domain knowledge more or less. Most existing approaches try to directly detect various scene changes by applying clustering or supervised learning methods to low level audiovisual features. However, robustly detecting diverse scene changes derived from complex semantic meanings is still a challenging problem. In this paper we are focused on the association of visual signal changes (e.g. cuts, fade-in, fade-out, etc.) and audio signal changes (e.g. speaker change, background music change, etc.) to propose a mid-level scene change representation, which is meant to locate candidate scene change points by characterizing temporally uncorrelated properties of audio and visual track in the case of scene change happening. By incorporating domain knowledge, enhanced features can be further extracted to complement this representation to bridge semantic gap towards scene change detection. We utilize a camera motion estimation algorithm to detect visual signal changes. Such visual change positions are selected as time-stamp points. An alignment is performed to search for candidate audio signal change positions by multi-scale Kullback-Leibler (K-L) distance computing. Both metric-based K-L distance approach and model-based HMM are applied to determine true audio signal changes. The associated visual and audio signal changes are considered as the mid-level scene change representation. This representation has been successfully applied to detect boundaries of individual commercial in TV broadcast stream with an accuracy of around 95%. Particularly the systematic alignment approach can be utilized in video summarization

[1]  Riccardo Leonardi,et al.  Audio as a support to scene change detection and characterization of video sequences , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Patrick Bouthemy,et al.  A unified approach to shot change detection and camera motion characterization , 1999, IEEE Trans. Circuits Syst. Video Technol..

[3]  Shih-Fu Chang,et al.  Audio scene segmentation using multiple features, models and time scales , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[4]  Zhu Liu,et al.  Joint video scene segmentation and classification based on hidden Markov model , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[5]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[6]  Zhu Liu,et al.  Joint scene classification and segmentation based on hidden Markov model , 2005, IEEE Transactions on Multimedia.

[7]  Shih-Fu Chang,et al.  Unsupervised discovery of multilevel statistical video structures using hierarchical hidden Markov models , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[8]  Dragutin Petkovic,et al.  Towards robust features for classifying audio in the CueVideo system , 1999, MULTIMEDIA '99.

[9]  C.-C. Jay Kuo,et al.  Audio content analysis for online audiovisual data segmentation and classification , 2001, IEEE Trans. Speech Audio Process..