Blind change detection for audio segmentation

Automatic segmentation of audio streams according to speaker identities and environmental and channel conditions has become an important preprocessing step for speech recognition, speaker recognition, and audio data mining. In most previous approaches, the automatic segmentation was evaluated in terms of the performance of the final system, like the word error rate for speech recognition systems. In many applications, like online audio indexing, and information retrieval systems, the actual boundaries of the segments are required. We present an approach based on the cumulative sum (CuSum) algorithm for automatic segmentation which minimizes the missing probability for a given false alarm rate. We compare the CuSum algorithm to the Bayesian information criterion (BIC) algorithm, and a generalization of the Kolmogorov-Smirnov test for automatic segmentation of audio streams. We present a two-step variation of the three algorithms which improves the performance significantly. We present also a novel approach that combines hypothesized boundaries from the three algorithms to achieve the final segmentation of the audio stream. Our experiments, on the 1998 Hub4 broadcast news, show that a variation of the CuSum algorithm significantly outperforms the other two approaches and that combining the three approaches using a voting scheme improves the performance slightly compared to using the a two-step variation of the CuSum algorithm alone.

[1]  G. Lorden PROCEDURES FOR REACTING TO A CHANGE IN DISTRIBUTION , 1971 .

[2]  D. Picard,et al.  Off-line statistical analysis of change-point models using non parametric and likelihood methods , 1985 .

[3]  Michèle Basseville,et al.  Detection of abrupt changes , 1993 .

[4]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[5]  H. Gish,et al.  Text-independent speaker identification , 1994, IEEE Signal Processing Magazine.

[6]  M. A. Siegler,et al.  Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[7]  Ramesh A. Gopinath,et al.  Transcription Of Broadcast News Shows With The Ibm Large Vocabulary Speech Recognition System , 1997 .

[8]  DetectionHomayoon,et al.  Speaker , Channel and Environment Change , 1998 .

[9]  George V. Moustakides Quickest Detection of Abrupt Changes for a Class of Random Processes , 1998, IEEE Trans. Inf. Theory.

[10]  Steve Young,et al.  Segment generation and clustering in the HTK broadcast news transcription system , 1998 .

[11]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[12]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[13]  C.-C. Jay Kuo,et al.  Heuristic approach for generic audio data segmentation and annotation , 1999, MULTIMEDIA '99.

[14]  Alexander H. Waibel,et al.  Strategies for automatic segmentation of audio data , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[15]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[16]  Jing Huang,et al.  Impact of audio segmentation and segment clustering on automated transcription accuracy of large spoken archives , 2003, INTERSPEECH.

[17]  Jean-Luc Gauvain,et al.  Audio Partitioning and Transcription for Broadcast Data Indexation , 2001, Multimedia Tools and Applications.