Co-clustering for Auditory Scene Categorization

Auditory scenes are temporal audio segments with coherent semantic content. Automatically classifying and grouping auditory scenes with similar semantics into categories is beneficial for many multimedia applications, such as semantic event detection and indexing. For such semantic categorization, auditory scenes are first characterized with either low-level acoustic features or some mid-level representations like audio effects, and then supervised classifiers or unsupervised clustering algorithms are employed to group scene segments into various semantic categories. In this paper, we focus on the problem of automatically categorizing audio scenes in unsupervised manner. To achieve more reasonable clustering results, we introduce the co-clustering scheme to exploit potential grouping trends among different dimensions of feature spaces (either low-level or mid-level feature spaces), and provide more accurate similarity measure for comparing auditory scenes. Moreover, we also extend the co-clustering scheme with a strategy based on the Bayesian information criterion (BIC) to automatically estimate the numbers of clusters. Evaluation performed on 272 auditory scenes extracted from 12-h audio data shows very encouraging categorization results. Co-clustering achieved a better performance compared to some traditional one-way clustering algorithms, both based on the low-level acoustic features and on the mid-level audio effect representations. Finally, we present our vision regarding the applicability of this approach on general multimedia data, and also show some preliminary results on content-based image clustering.

[1]  Vesa T. Peltonen,et al.  Audio-based context recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Mohan S. Kankanhalli,et al.  Creating audio keywords for event detection in soccer video , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[3]  Juha T. Tuomi,et al.  Audio-based context awareness - acoustic modeling and perceptual evaluation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[4]  Vesa T. Peltonen,et al.  Computational auditory scene recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Wen-Huang Cheng,et al.  Generative and Discriminative Modeling toward Semantic Context Detection in Audio Tracks , 2005, 11th International Multimedia Modelling Conference.

[6]  Alan Hanjalic,et al.  Moving away from narrow-scope solutions in multimedia content analysis , 2005 .

[7]  Lie Lu,et al.  A flexible framework for key audio effects detection and auditory context inference , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Deb K. Roy,et al.  Learning visually grounded words and syntax for a scene description task , 2002, Comput. Speech Lang..

[9]  Alex Pentland,et al.  Unsupervised clustering of ambulatory audio and video , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[10]  Shamik Sural,et al.  Segmentation and histogram generation using the HSV color space for image retrieval , 2002, Proceedings. International Conference on Image Processing.

[11]  Daniel Hanisch,et al.  Co-clustering of biological networks and gene expression data , 2002, ISMB.

[12]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[13]  Albrecht Schmidt,et al.  Recognizing context for annotating a live life recording , 2007, Personal and Ubiquitous Computing.

[14]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[15]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[16]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[17]  L. Wasserman,et al.  A Reference Bayesian Test for Nested Hypotheses and its Relationship to the Schwarz Criterion , 1995 .

[18]  Sameer A. Nene,et al.  Columbia Object Image Library (COIL100) , 1996 .

[19]  Shih-Fu Chang,et al.  Audio scene segmentation using multiple features, models and time scales , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[20]  Ran El-Yaniv,et al.  Iterative Double Clustering for Unsupervised and Semi-supervised Learning , 2001, ECML.

[21]  Lie Lu,et al.  Highlight sound effects detection in audio stream , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[22]  Wen-Huang Cheng,et al.  Semantic context detection based on hierarchical audio models , 2003, MIR '03.

[23]  Keansub Lee,et al.  Minimal-impact audio-based personal archives , 2004, CARPE'04.

[24]  Inderjit S. Dhillon,et al.  Information theoretic clustering of sparse cooccurrence data , 2003, Third IEEE International Conference on Data Mining.

[25]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[26]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[27]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[28]  Ben P. Milner,et al.  Context awareness using environmental noise classification , 2003, INTERSPEECH.

[29]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .