论文信息 - High Performance Personal Adaptation Speech Recognition Framework by Incremental Learning with Plural Language Models

High Performance Personal Adaptation Speech Recognition Framework by Incremental Learning with Plural Language Models

This paper introduces a speech recognition framework for high performance personalized adaption. It is based on plural language models and personalized incremental learning interface for error correction. If an error in a recognition result is detected by a bidirectional neural language model, it generates a corrected sentence by a majority decision among multiple n-gram language models considering several aspects. Moreover, we introduce a speaker adaptation by updating language models through incremental learning, which can adjust the parameter from training data. The experiments show that our framework improves word-error rate to 78% compared with Google Chrome's Speech Recognition API. Our framework can be used for improving one-to-one human-machine dialogue systems such as intelligent (counseling) agents.

[1] John J. Godfrey,et al. SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2] David D. Palmer,et al. Context-based Speech Recognition Error Detection and Correction , 2004, NAACL.

[3] Jerome R. Bellegarda,et al. Large-scale personal assistant technology deployment: the siri experience , 2013, INTERSPEECH.

[4] Kikuo Maekawa,et al. Balanced corpus of contemporary written Japanese , 2013, Language Resources and Evaluation.

[5] Andrew J. Viterbi,et al. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[6] Matthew B Hoy. Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants , 2018, Medical reference services quarterly.

[7] Felix Burkhardt,et al. Voice search in mobile applications and the use of linked open data , 2013, INTERSPEECH.

[8] Andreas Stolcke,et al. The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Ronald Rosenfeld,et al. Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[10] Tetsuya Takiguchi,et al. Two-step correction of speech recognition errors based on n-gram and long contextual information , 2013, INTERSPEECH.

[11] Gary Geunbae Lee,et al. Speech recognition error correction using maximum entropy language model , 2004, INTERSPEECH.

[12] Nobuhiro Kaji,et al. Predicting Causes of Reformulation in Intelligent Assistants , 2017, SIGDIAL Conference.

[13] Hermann Ney,et al. Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[14] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15] Yuji Matsumoto,et al. Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[16] Ernesto Damiani,et al. VICA, a visual counseling agent for emotional distress , 2019, J. Ambient Intell. Humaniz. Comput..

[17] K. Maekawa. CORPUS OF SPONTANEOUS JAPANESE : ITS DESIGN AND EVALUATION , 2003 .