Bayesian semi-supervised audio event transcription based on Markov indian buffet process

We present a novel generative model for audio event transcription that recognizes “events” on audio signals including multiple kinds of overlapping sounds. In the proposed model, firstly, the overlapping audio events are modeled based on nonnegative matrix factorization into which Bayesian nonparametric approaches: the Markov Indian buffet process and the Chinese restaurant process, are incorporated. This approach allows us to automatically transcribe the events while avoiding the model selection problem by assuming a countably infinite number of possible audio events in the input signal. Then, Bayesian logistic regression annotates the audio frames with the multiple event labels in a semi-supervised learning setup. Experimental results show that our model can better annotate an audio signal in comparison with a baseline method. Additionally, we verify that our infinite generative model is also able to detect unknown audio events that are not included in the training data.

[1]  Matthew D. Hoffman Poisson-uniform nonnegative matrix factorization , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Yee Whye Teh,et al.  The Mondrian Process , 2008, NIPS.

[3]  Yee Whye Teh,et al.  Stick-breaking Construction for the Indian Buffet Process , 2007, AISTATS.

[4]  Chin-Hui Lee,et al.  Consumer-level multimedia event detection through unsupervised audio signal modeling , 2012, INTERSPEECH.

[5]  Thomas L. Griffiths,et al.  Infinite latent feature models and the Indian buffet process , 2005, NIPS.

[6]  C. Févotte,et al.  Automatic Relevance Determination in Nonnegative Matrix Factorization with the-Divergence , 2011 .

[7]  David B. Dunson,et al.  Beta-Negative Binomial Process and Poisson Factor Analysis , 2011, AISTATS.

[8]  Bhiksha Raj,et al.  Audio event detection from acoustic unit occurrence patterns , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Jun S. Liu,et al.  The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem , 1994 .

[10]  H. Ishwaran,et al.  Exact and approximate sum representations for the Dirichlet process , 2002 .

[11]  Ananya Misra,et al.  Speech/Nonspeech Segmentation in Web Videos , 2012, INTERSPEECH.

[12]  Annamaria Mesaros,et al.  Sound Event Detection in Multisource Environments Using Source Separation , 2011 .

[13]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[14]  Simon J. Godsill,et al.  Bayesian extensions to non-negative matrix factorisation for audio signal modelling , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[16]  S. L. Scott Bayesian Methods for Hidden Markov Models , 2002 .

[17]  S. Walker Invited comment on the paper "Slice Sampling" by Radford Neal , 2003 .

[18]  Björn W. Schuller,et al.  Semi-supervised learning helps in sound event classification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  David J. C. MacKay,et al.  The Evidence Framework Applied to Classification Networks , 1992, Neural Computation.

[20]  Taras Butko,et al.  Audio segmentation of broadcast news: A hierarchical system with feature selection for the Albayzin-2010 evaluation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Tuomas Virtanen,et al.  Acoustic event detection in real life recordings , 2010, 2010 18th European Signal Processing Conference.

[22]  Yee Whye Teh,et al.  The Infinite Factorial Hidden Markov Model , 2008, NIPS.

[23]  Tatsuya Kawahara,et al.  Acoustic event detection for spotting "hot spots" in podcasts , 2009, INTERSPEECH.

[24]  Radford M. Neal Slice Sampling , 2003, The Annals of Statistics.

[25]  Björn W. Schuller,et al.  Convolutive Non-Negative Sparse Coding and New Features for Speech Overlap Handling in Speaker Diarization , 2012, INTERSPEECH.

[26]  Vincent Y. F. Tan,et al.  Automatic Relevance Determination in Nonnegative Matrix Factorization with the /spl beta/-Divergence , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  D. Aldous Representations for partially exchangeable arrays of random variables , 1981 .

[28]  Bhiksha Raj,et al.  Exploiting Temporal Sequence Structure for Semantic Analysis of Multimedia , 2012, INTERSPEECH.

[29]  Andrey Temko,et al.  CLEAR Evaluation of Acoustic Event Detection and Classification Systems , 2006, CLEAR.

[30]  Andrey Temko,et al.  Acoustic event detection in meeting-room environments , 2009, Pattern Recognit. Lett..

[31]  Chloé Clavel,et al.  Events Detection for an Audio-Based Surveillance System , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[32]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[33]  Murat Akbacak,et al.  Bag-of-Audio-Words Approach for Multimedia Event Classification , 2012, INTERSPEECH.

[34]  Michael I. Jordan,et al.  Hierarchical Bayesian Nonparametric Models with Applications , 2008 .

[35]  Maurice Charbit,et al.  Factorial Scaled Hidden Markov Model for polyphonic audio representation and source separation , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[36]  Shuang Wu,et al.  Compact Audio Representation for Event Detection in Consumer Media , 2012, INTERSPEECH.

[37]  Ali Taylan Cemgil,et al.  Bayesian inference in hierarchical non‐negative matrix factorisation models of musical sounds , 2008 .

[38]  Florian Metze,et al.  Event-based Video Retrieval Using Audio , 2012, INTERSPEECH.

[39]  Hirokazu Kameoka,et al.  Infinite-state spectrum model for music signal analysis , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[41]  Masakiyo Fujimoto,et al.  A tandem connectionist model using combination of multi-scale spectro-temporal features for acoustic event detection , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  David J. Spiegelhalter,et al.  Sequential updating of conditional probabilities on directed graphical structures , 1990, Networks.

[43]  Michael I. Jordan,et al.  Bayesian Nonparametrics: Hierarchical Bayesian nonparametric models with applications , 2010 .