A cross-channel modeling approach for automatic segmentation of conversational telephone speech [automatic speech recognition applications]

Automatic segmentation of audio is an essential front-end process for automatic speech recognition applications where true speech boundaries are unknown. In this paper, we present a cross-channel modeling approach for segmentation in a specific domain - 4-wire recorded conversational telephone speech. The paper describes and compares two types of cross-channel modeling - energy-based and Gaussian mixture model. Since improving speech recognition accuracy is our main objective, the effectiveness of automatic segmentation is measured using the word-error-rate (WER) and compared with a manual-segmentation baseline. With cross-channel modeling, we obtained a negligible WER difference between manual and automatic segmentation on three different languages. Issues, such as training data preparation, features, and language-dependency, are also discussed.

[1]  Daben Liu,et al.  Fast speaker change detection for broadcast news transcription and indexing , 1999, EUROSPEECH.

[2]  Daben Liu,et al.  Online speaker clustering , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..