Generative and Discriminative Modeling toward Semantic Context Detection in Audio Tracks

Semantic-level content analysis is a crucial issue to achieve efficient content retrieval and management. We propose a hierarchical approach that models the statistical characteristics of several audio events over a time series to accomplish semantic context detection. Two stages, including audio event and semantic context modeling/testing, are devised to bridge the semantic gap between physical audio features and semantic concepts. For action movies we focused in this work, hidden Markov models (HMMs) are used to model four representative audio events, i.e. gunshot, explosion, car-braking, and engine sounds. At the semantic context level, generative (ergodic hidden Markov model) and discriminative (support vector machine, SVM) approaches are investigated to fuse the characteristics and correlations among various audio events, which provide cues for detecting gunplay and car-chasing scenes. The experimental results demonstrate the effectiveness of the proposed approaches and draw a sketch for semantic indexing and retrieval. Moreover, the differences between two fusion schemes are discussed to be the reference for future research.

[1]  Wolfgang Effelsberg,et al.  Automatic recognition of film genres , 1995, MULTIMEDIA '95.

[2]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[3]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[4]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[5]  Zhu Liu,et al.  Multimedia content analysis-using both audio and visual clues , 2000, IEEE Signal Process. Mag..

[6]  Svetha Venkatesh,et al.  Horror film genre typing and scene labeling via audio analysis , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[7]  Milind R. Naphade,et al.  Extracting semantics from audio-visual content: the final frontier in multimedia retrieval , 2002, IEEE Trans. Neural Networks.

[8]  Lie Lu,et al.  Highlight sound effects detection in audio stream , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[9]  Wen-Huang Cheng,et al.  Semantic context detection based on hierarchical audio models , 2003, MIR '03.

[10]  David G. Stork,et al.  Pattern Classification , 1973 .

[11]  Zhu Liu,et al.  Classification TV programs based on audio information using hidden Markov model , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[12]  Svetha Venkatesh,et al.  Media computing : computational media aesthetics , 2002 .

[13]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[14]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[15]  C.-C. Jay Kuo,et al.  Hierarchical system for content-based audio classification and retrieval , 1998, Other Conferences.

[16]  Sing-Tze Bow,et al.  Pattern recognition and image preprocessing , 1992 .

[17]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[18]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[19]  H. Zettl Sight, Sound, Motion: Applied Media Aesthetics , 1973 .