Studies in massively speaker-specific speech recognition

Over the past several years, the primary focus for the speech-recognition research community has been speaker-independent speech recognition, with the emphasis of working on databases with larger and larger numbers of speakers. For example, the most recent EARS program, which is sponsored by DARPA, calls for recordings of thousands of speakers. However, we are interested in making a speech interface work well for one particular individual, and we propose using massive amounts of speaker-specific training data recorded in daily life. We call this massively speaker-specific recognition (MSSR). As a pre-research, we leverage the large corpus we have available from speech-synthesis work to study the benefit of MSSR only from the acoustic-modeling aspect. Initial results show that, by changing the focus to MSSR, word error rates can drop very significantly. In comparison with speaker-adaptive speech recognition systems, MSSR also performs better since model parameters can be tuned to be suitable to one particular individual.

[1]  Shui-Lung Chuang,et al.  New Word Learning for Spoken Document Processing Through Discovery of Comparable Texts from External Resources , 2003 .

[2]  K.F. Lee,et al.  On speaker-independent, speaker-dependent, and speaker-adaptive speech recognition , 1993, IEEE Trans. Speech Audio Process..

[3]  Alex Waibel,et al.  New developments in automatic meeting transcription , 2000, INTERSPEECH.

[4]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[5]  Tao Chen,et al.  Speaker selection training for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Yu Shi,et al.  Speech lab in a box: a Mandarin speech toolbox to jumpstart speech related research , 2001, INTERSPEECH.

[7]  Tao Chen,et al.  Adaptive model combination for dynamic speaker selection training , 2002, INTERSPEECH.

[8]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[9]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.