Audio Keywords Discovery for Text-Like Audio Content Analysis and Retrieval

Inspired by classical text document analysis employing the concept of (key) words, this paper presents an unsupervised approach to discover (key) audio elements in general audio documents. The (key) audio elements can be considered the equivalents of the text (key) words, and enable content-based audio analysis and retrieval following the analogy to the proven text analysis theories and methods. Since general audio signals usually show complicated and strongly varying distribution and density in the feature space, we propose an iterative spectral clustering method with context-dependent scaling factors to decompose an audio data stream into audio elements. Using this clustering method, temporal signal segments with similar low-level features are grouped into natural clusters that we adopt as audio elements. To detect those audio elements that are most representative for the semantic content, that is, the key audio elements, two cases are considered. First, if only one audio document is available for analysis, a number of heuristic importance indicators are defined and employed to detect the key audio elements. For the case that multiple audio documents are available, more sophisticated measures for audio element importance, including expected term frequency (ETF), expected inverse document frequency (EIDF), expected term duration (ETD) and expected inverse document duration (EIDD), are proposed. Our experiments showed encouraging results regarding the quality of the obtained (key) audio elements and their potential applicability for content-based audio document analysis and retrieval.

[1]  Jianbo Shi,et al.  Multiclass spectral clustering , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[2]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[3]  Guy L. Scott,et al.  Feature grouping by 'relocalisation' of eigenvectors of the proximity matrix , 1990, BMVC.

[4]  Lie Lu,et al.  Towards a unified framework for content-based audio analysis , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[5]  Tsuhan Chen,et al.  Audio Feature Extraction and Analysis for Scene Segmentation and Classification , 1998, J. VLSI Signal Process..

[6]  Yair Weiss,et al.  Segmentation using eigenvectors: a unifying view , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[7]  Regunathan Radhakrishnan,et al.  A time series clustering based framework for multimedia mining and summarization using audio features , 2004, MIR '04.

[8]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[9]  Lie Lu,et al.  Highlight sound effects detection in audio stream , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[10]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[11]  Lie Lu,et al.  Improve audio representation by using feature structure patterns , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Shih-Fu Chang,et al.  Determining computable scenes in films and their structures using audio-visual memory models , 2000, ACM Multimedia.

[13]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Lie Lu,et al.  Unsupervised content discovery in composite audio , 2005, MULTIMEDIA '05.

[15]  Wen-Huang Cheng,et al.  Semantic context detection based on hierarchical audio models , 2003, MIR '03.

[16]  Svetha Venkatesh,et al.  Detecting indexical signs in film audio for scene interpretation , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[17]  Lie Lu,et al.  A flexible framework for key audio effects detection and auditory context inference , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Mohan S. Kankanhalli,et al.  Creating audio keywords for event detection in soccer video , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[19]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[20]  Chong-Wah Ngo,et al.  Video summarization and scene detection by graph modeling , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[21]  Shingo Uchihashi,et al.  Video Manga: generating semantically meaningful video summaries , 1999, MULTIMEDIA '99.

[22]  Lie Lu,et al.  Audio Elements Based Auditory Scene Segmentation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.