Automatic state discovery for unstructured audio scene classification

In this paper we present a novel scheme for unstructured audio scene classification that possesses three highly desirable and powerful features: autonomy, scalability, and robustness. Our scheme is based on our recently introduced machine learning algorithm called Simultaneous Temporal And Contextual Splitting (STACS) that discovers the appropriate number of states and efficiently learns accurate Hidden Markov Model (HMM) parameters for the given data. STACS-based algorithms train HMMs up to five times faster than Baum-Welch, avoid the overfitting problem commonly encountered in learning large state-space HMMs using Expectation Maximization (EM) methods such as Baum-Welch, and achieve superior classification results on a very diverse dataset with minimal pre-processing. Furthermore, our scheme has proven to be highly effective for building real-world applications and has been integrated into a commercial surveillance system as an event detection component.

[1]  Mari Ostendorf,et al.  HMM topology design using maximum likelihood successive state splitting , 1997, Comput. Speech Lang..

[2]  João Paulo da Silva Neto,et al.  Non-speech audio event detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Nikos Fakotakis,et al.  On acoustic surveillance of hazardous situations , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Lianhong Cai,et al.  Cultural style based music classification of audio signals , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Gautam Biswas,et al.  Temporal Pattern Generation Using Hidden Markov Model Based Unsupervised Classification , 1999, IDA.

[6]  H.G. Okuno,et al.  Computational Auditory Scene Analysis and its Application to Robot Audition , 2004, 2008 Hands-Free Speech Communication and Microphone Arrays.

[7]  Hermann Ney,et al.  Audio segmentation for speech recognition using segment features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Andrew W. Moore,et al.  Fast State Discovery for HMM Model Selection and Learning , 2007, AISTATS.

[9]  Marijn Huijbregts,et al.  The ICSI RT07s Speaker Diarization System , 2007, CLEAR.

[10]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[11]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[12]  Andreas Stolcke,et al.  Best-first Model Merging for Hidden Markov Model Induction , 1994, ArXiv.

[13]  H.G. Okuno,et al.  Computational Auditory Scene Analysis and Its Application to Robot Audition: Five Years Experience , 2007, Second International Conference on Informatics Research for Development of Knowledge Society Infrastructure (ICKS'07).