Segregation of Speakers for Speaker Adaptation in TV News Audio

Speaker adaptation is commonly used to compensate speaker variation in large vocabulary continuous speech recognition. In a multi-speaker environment where speakers change frequently speaker segregation is needed to divide the input audio stream to speaker turns. Speaker turns define the current speaker at each time and speaker adaptation can thus be done based on speaker turns. The novelty of this paper is that the speaker-specific transformations are estimated incrementally and in tandem with speaker segregation. Therefore we need a transformation that can be reliably estimated based on one speaker turn alone. We propose the constrained maximum likelihood linear regression (CMLLR) for this. In testing with Finnish TV news audio, speaker adaptation reduced the average letter error rate 25% relative to baseline.

[1]  Daben Liu,et al.  Fast speaker change detection for broadcast news transcription and indexing , 1999, EUROSPEECH.

[2]  Vesa Siivola,et al.  Growing an n-gram language model , 2005, INTERSPEECH.

[3]  Mikko Kurimo,et al.  Duration modeling techniques for continuous speech recognition , 2004, INTERSPEECH.

[4]  Zhipeng Zhang,et al.  On-line incremental speaker adaptation with automatic speaker change detection , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[5]  Amit Srivastava,et al.  Online speaker adaptation and tracking for real-time speech recognition , 2005, INTERSPEECH.

[6]  Herbert Gish,et al.  Segregation of speakers for speech recognition and speaker identification , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[7]  Krzysztof Marasek,et al.  SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.

[8]  Lie Lu,et al.  Speaker change detection and tracking in real-time news broadcasting analysis , 2002, MULTIMEDIA '02.

[9]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[10]  Janne Pylkkönen AN EFFICIENT ONE-PASS DECODER FOR FINNISH LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , .

[11]  Hervé Bourlard,et al.  Unknown-multiple speaker clustering using HMM , 2002, INTERSPEECH.

[12]  H. Gish,et al.  Text-independent speaker identification , 1994, IEEE Signal Processing Magazine.

[13]  Teemu Hirsimäki,et al.  On Growing and Pruning Kneser–Ney Smoothed $ N$-Gram Models , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[15]  Mikko Kurimo,et al.  Unlimited vocabulary speech recognition with morph language models applied to Finnish , 2006, Comput. Speech Lang..