High Performance Personal Adaptation Speech Recognition Framework by Incremental Learning with Plural Language Models

This paper introduces a speech recognition framework for high performance personalized adaption. It is based on plural language models and personalized incremental learning interface for error correction. If an error in a recognition result is detected by a bidirectional neural language model, it generates a corrected sentence by a majority decision among multiple n-gram language models considering several aspects. Moreover, we introduce a speaker adaptation by updating language models through incremental learning, which can adjust the parameter from training data. The experiments show that our framework improves word-error rate to 78% compared with Google Chrome's Speech Recognition API. Our framework can be used for improving one-to-one human-machine dialogue systems such as intelligent (counseling) agents.

[1]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  David D. Palmer,et al.  Context-based Speech Recognition Error Detection and Correction , 2004, NAACL.

[3]  Jerome R. Bellegarda,et al.  Large-scale personal assistant technology deployment: the siri experience , 2013, INTERSPEECH.

[4]  Kikuo Maekawa,et al.  Balanced corpus of contemporary written Japanese , 2013, Language Resources and Evaluation.

[5]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[6]  Matthew B Hoy Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants , 2018, Medical reference services quarterly.

[7]  Felix Burkhardt,et al.  Voice search in mobile applications and the use of linked open data , 2013, INTERSPEECH.

[8]  Andreas Stolcke,et al.  The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Ronald Rosenfeld,et al.  Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[10]  Tetsuya Takiguchi,et al.  Two-step correction of speech recognition errors based on n-gram and long contextual information , 2013, INTERSPEECH.

[11]  Gary Geunbae Lee,et al.  Speech recognition error correction using maximum entropy language model , 2004, INTERSPEECH.

[12]  Nobuhiro Kaji,et al.  Predicting Causes of Reformulation in Intelligent Assistants , 2017, SIGDIAL Conference.

[13]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Yuji Matsumoto,et al.  Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[16]  Ernesto Damiani,et al.  VICA, a visual counseling agent for emotional distress , 2019, J. Ambient Intell. Humaniz. Comput..

[17]  K. Maekawa CORPUS OF SPONTANEOUS JAPANESE : ITS DESIGN AND EVALUATION , 2003 .