Unsupervised speaker indexing using anchor models and automatic transcription of discussions

We present unsupervised speaker indexing combined with automatic speech recognition (ASR) for speech archives such as discussions. Our proposed indexing method is based on anchor models, by which we define a feature vector based on the similarity with speakers of a large scale speech database. Several techniques are introduced to improve discriminant ability. ASR is performed using the results of this indexing. No discussion corpus is available to train acoustic and language models. So we applied the speaker adaptation technique to the baseline acoustic model based on the indexing. We also constructed a language model by merging two models that cover different linguistic features. We achieved the speaker indexing accuracy of 93% and the significant improvement of ASR for real discussion data.

[1]  Douglas E. Sturim,et al.  Speaker indexing in large audio databases using anchor models , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[2]  Delphine Charlet,et al.  Speaker identification by location in an optimal space of anchor models , 2002, INTERSPEECH.

[3]  Herbert Gish,et al.  Clustering speakers by their voices , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Tatsuya Kawahara,et al.  Speaking-rate dependent decoding and adaptation for spontaneous lecture speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Masahide Sugiyama,et al.  Unknown-multiple signal source clustering problem using ergodic HMM and applied to speaker classification , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[6]  Hitoshi Isahara,et al.  Toward the realization of spontaneous speech recognition - introduction of a Japanese priority program and preliminary results - , 2000, INTERSPEECH.

[7]  Masafumi Nishida,et al.  Real time speaker indexing based on subspace method - application to TV news articles and debate , 1998, ICSLP.

[8]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[9]  Hervé Bourlard,et al.  Unknown-multiple speaker clustering using HMM , 2002, INTERSPEECH.