Toward semantic indexing and retrieval using hierarchical audio models

Semantic-level content analysis is a crucial issue in achieving efficient content retrieval and management. We propose a hierarchical approach that models the statistical characteristics of audio events over a time series to accomplish semantic context detection. Two stages, audio event and semantic context modeling, are devised to bridge the semantic gap between physical audio features and semantic concepts. In this work, hidden Markov models (HMMs) are used to model four representative audio events, i.e., gunshot, explosion, engine, and car-braking, in action movies. At the semantic-context level, Gaussian mixture models (GMMs) and ergodic HMMs are investigated to fuse the characteristics and correlations between various audio events. They provide cues for detecting gunplay and car-chasing scenes, two semantic contexts we focus on in this work. The promising experimental results demonstrate the effectiveness of the proposed approach and exhibit that the proposed framework provides a foundation in semantic indexing and retrieval. Moreover, the two fusion schemes are compared, and the relations between audio event and semantic context are studied.

[1]  Salvatore J. Stolfo,et al.  JAM: Java Agents for Meta-Learning over Distributed Databases , 1997, KDD.

[2]  Svetha Venkatesh,et al.  Media computing : computational media aesthetics , 2002 .

[3]  Zhu Liu,et al.  Classification TV programs based on audio information using hidden Markov model , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[4]  Stan Z. Li,et al.  Content-based audio classification and retrieval using the nearest feature line method , 2000, IEEE Trans. Speech Audio Process..

[5]  Lie Lu,et al.  Automated extraction of music snippets , 2003, ACM Multimedia.

[6]  Milind R. Naphade,et al.  A probabilistic framework for semantic video indexing, filtering, and retrieval , 2001, IEEE Trans. Multim..

[7]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[8]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[9]  Wei-Hao Lin,et al.  Meta-classification: Combining Multimodal Classifiers , 2002, Revised Papers from MDM/KDD and PAKDD/KDMCD.

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  Svetha Venkatesh,et al.  Horror film genre typing and scene labeling via audio analysis , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[12]  John R. Smith,et al.  Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues , 2003, EURASIP J. Adv. Signal Process..

[13]  Thomas S. Huang,et al.  Audio-visual event detection using duration dependent input output Markov models , 2001, Proceedings IEEE Workshop on Content-Based Access of Image and Video Libraries (CBAIVL 2001).

[14]  Jean-Francois Cardoso,et al.  Blind signal separation: statistical principles , 1998, Proc. IEEE.

[15]  Alan Hanjalic,et al.  Shot-boundary detection: unraveled and resolved? , 2002, IEEE Trans. Circuits Syst. Video Technol..

[16]  H. Zettl Sight, Sound, Motion: Applied Media Aesthetics , 1973 .

[17]  Ying Li,et al.  An Overview of Video Abstraction Techniques , 2001 .

[18]  Shrikanth S. Narayanan,et al.  Split-lexicon based hierarchical recognition of speech using syllable and word level acoustic units , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[19]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[20]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[21]  C. Koch,et al.  Computational modelling of visual attention , 2001, Nature Reviews Neuroscience.

[22]  Sing-Tze Bow,et al.  Pattern recognition and image preprocessing , 1992 .

[23]  Milind R. Naphade,et al.  A probabilistic framework for semantic indexing and retrieval in video , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[24]  X. Jin Factor graphs and the Sum-Product Algorithm , 2002 .

[25]  Nevenka Dimitrova Context and Memory in Multimedia Content Analysis , 2004, IEEE Multim..

[26]  Chabane Djeraba,et al.  Mining Multimedia and Complex Data , 2002, Lecture Notes in Computer Science.

[27]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[28]  Brendan J. Frey,et al.  Probabilistic multimedia objects (multijects): a novel approach to video indexing and retrieval in multimedia systems , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[29]  David G. Stork,et al.  Pattern Classification , 1973 .

[30]  Heinz Hügli,et al.  Empirical Validation of the Saliency-based Model of Visual Attention , 2003 .

[31]  Zhu Liu,et al.  Multimedia content analysis-using both audio and visual clues , 2000, IEEE Signal Process. Mag..

[32]  Boon-Lock Yeo,et al.  Rapid scene analysis on compressed video , 1995, IEEE Trans. Circuits Syst. Video Technol..

[33]  Stan Z. Li,et al.  Content-based Classification and Retrieval of Audio Using the Nearest Feature Line Method , 2000 .

[34]  Wolfgang Effelsberg,et al.  Abstracting Digital Movies Automatically , 1996, J. Vis. Commun. Image Represent..

[35]  Avideh Zakhor,et al.  Applications of Video-Content Analysis and Retrieval , 2002, IEEE Multim..

[36]  Chng Eng Siong,et al.  Sports highlight detection from keyword sequences using HMM , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[37]  C.-C. Jay Kuo,et al.  Hierarchical system for content-based audio classification and retrieval , 1998, Other Conferences.

[38]  Lie Lu,et al.  Highlight sound effects detection in audio stream , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[39]  Milind R. Naphade,et al.  Extracting semantics from audio-visual content: the final frontier in multimedia retrieval , 2002, IEEE Trans. Neural Networks.