Unsupervised Audio Segmentation using Extended Baum-Welch Transformations

Audio segmentation has applications in a variety of contexts, such as audio information retrieval, automatic sound analysis, and as a pre-processing step in speech recognition. Extended Baum-Welch (EBW) transformations are most commonly used as a discriminative technique for estimating parameters of Gaussian mixtures. In this paper, we derive an unsupervised audio segmentation approach using these transformations. We find that our algorithm outperforms both the Bayesian information criterion (BIC) and cumulative sum (CUSUM) segmentation methods. In particular, our EBW segmentation algorithm provides improvements over the baseline approaches in detecting landmarks of short duration and minimizing landmark oversegmentation. In addition, we show that the EBW approach provides faster computation compared to the baseline methods.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  Mohamed Kamal Omar,et al.  Blind change detection for audio segmentation , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[3]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[4]  Dimitri Kanevsky Extended Baum transformations for general functions , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[6]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[7]  Alexander H. Waibel CHIL - Computers in the Human Interaction Loop , 2005, MVA.

[8]  Victor Zue,et al.  Automatic transcription of general audio data: preliminary analyses , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[9]  D. Kanevsky Extended Baum Transformations for General Functions , II , 2005 .

[10]  Jing Huang,et al.  Impact of audio segmentation and segment clustering on automated transcription accuracy of large spoken archives , 2003, INTERSPEECH.