Multi-label audio concept detection using correlated-aspect Gaussian Mixture Model

As an essentially multi-label classification problem, audio concept detection is normally solved by treating concepts independently. Since in this process the original useful concept correlation information is missing, this paper proposes a new model named Correlated-Aspect Gaussian Mixture Model (C-AGMM) to take advantage of such a clue for enhancing multi-label audio concept detection. Originating from Aspect Gaussian Mixture Model (AGMM) which improves GMM by incorporating it into probabilistic Latent Semantic Analysis (pLSA), C-AGMM still learns a probabilistic model of the whole audio clip by regarding concepts as its component elements. However, different from AGMM that assumes concepts independent with each other, C-AGMM considers their distribution on a sub-manifold embedded in the ambient space. With an assumption that if two concepts are close in the intrinsic geometry of this distribution then their conditional probability distributions are likely to show similarity, a graph regularizer is exploited to model the correlation between these concepts. Following the Maximum Likelihood Estimate principle, model parameters of C-AGMM encoding the concept correlation clue are derived and used directly as the detection criterion. Experiments on two datasets show the effectiveness of our proposed model.

[1]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[2]  C.-C. Jay Kuo,et al.  Audio content analysis for online audiovisual data segmentation and classification , 2001, IEEE Trans. Speech Audio Process..

[3]  Chong-Wah Ngo,et al.  Domain adaptive semantic diffusion for large scale context-based video annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[4]  Qiang Ji,et al.  Emotional tagging of videos by exploring multiple emotions' coexistence , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[5]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[6]  Ben P. Milner,et al.  Acoustic environment classification , 2006, TSLP.

[7]  Changhu Wang,et al.  Content-Based Image Annotation Refinement , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[9]  C. Goutte,et al.  Co-Occurrence Models in Music Genre Classification , 2005, 2005 IEEE Workshop on Machine Learning for Signal Processing.

[10]  Tao Mei,et al.  Refining video annotation by exploiting pairwise concurrent relation , 2007, ACM Multimedia.

[11]  Wen Gao,et al.  Sequence Multi-Labeling: A Unified Video Annotation Scheme With Spatial and Temporal Context , 2010, IEEE Transactions on Multimedia.

[12]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[13]  Chong-Wah Ngo,et al.  Towards textually describing complex video contents with audio-visual concept classifiers , 2011, ACM Multimedia.

[14]  Thierry Bertin-Mahieux,et al.  Autotagger: A Model for Predicting Social Tags from Acoustic Features on Large Music Databases , 2008 .

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[16]  Jiawei Han,et al.  Modeling hidden topics on document manifold , 2008, CIKM '08.

[17]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[18]  Deng Cai,et al.  Gaussian Mixture Model with Local Consistency , 2010, AAAI.

[19]  C.-C. Jay Kuo,et al.  Content Analysis for Acoustic Environment Classification in Mobile Robots , 2006, AAAI Fall Symposium: Aurally Informed Performance.

[20]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[21]  Daniel P. W. Ellis,et al.  Classifying soundtracks with audio texture features , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Vesa T. Peltonen,et al.  Audio-based context recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Daniel P. W. Ellis,et al.  Audio-Based Semantic Concept Classification for Consumer Video , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Daniel P. W. Ellis,et al.  Detecting local semantic concepts in environmental sounds using Markov model based clustering , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.