A cross-channel modeling approach for automatic segmentation of conversational telephone speech [automatic speech recognition applications]
暂无分享,去创建一个
Automatic segmentation of audio is an essential front-end process for automatic speech recognition applications where true speech boundaries are unknown. In this paper, we present a cross-channel modeling approach for segmentation in a specific domain - 4-wire recorded conversational telephone speech. The paper describes and compares two types of cross-channel modeling - energy-based and Gaussian mixture model. Since improving speech recognition accuracy is our main objective, the effectiveness of automatic segmentation is measured using the word-error-rate (WER) and compared with a manual-segmentation baseline. With cross-channel modeling, we obtained a negligible WER difference between manual and automatic segmentation on three different languages. Issues, such as training data preparation, features, and language-dependency, are also discussed.
[1] Daben Liu,et al. Fast speaker change detection for broadcast news transcription and indexing , 1999, EUROSPEECH.
[2] Daben Liu,et al. Online speaker clustering , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..