论文信息 - Deep neural network based acoustic model using speaker-class information for short time utterance

Deep neural network based acoustic model using speaker-class information for short time utterance

In speech recognition, it is preferable not to hypothesize the details, e.g., specific age and gender, of a target user. However, speaker independence is one of the things that degrades ASR performance. In this work, we propose a speaker adaptation method to recognize a short time utterance. There have been several studies on speaker-independent DNN-HMM in which i-vector is computed, and the additional information is combined with acoustic features. However, it is difficult to calculate i-vector accurately or apply speaker adaptation (e.g. fMLLR) when the utterance time is short (0.5sec~). In our approach, we calculate the similarity score between the speaker class and the target utterance and utilize speaker class information configured in advance. As a precondition, we restrict the available time period to the first 50 frames per utterance for the recognition of short utterances. In experimental tests, we obtained a 4.0% relative WER gain compared to conventional DNN-HMM.

Seiichi Nakagawa | Hiroshi Seki | Kazumasa Yamamoto

[1] Shuichi Itahashi,et al. JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .

[2] Seiichi Nakagawa,et al. Distant Speech Recognition Using a Microphone Array Network , 2010, IEICE Trans. Inf. Syst..

[3] Michael Picheny,et al. Speaker clustering and transformation for speaker adaptation in large-vocabulary speech recognition systems , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[4] Hui Jiang,et al. Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5] Olli Viikki,et al. A recursive feature vector normalization approach for robust speech recognition in noise , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[6] Seiichi Nakagawa,et al. Large vocabulary speech recognition system: SPOJUS++ , 2011 .

[7] Thomas Hain,et al. An investigation into speaker informed DNN front-end for LVCSR , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] George Saon,et al. Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[9] Seiichi Nakagawa,et al. Soft-clustering technique for training data in Age-and gender-independent speech recognition , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[10] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[11] Tsujikawa Misaki,et al. Study on i-vector based speaker identification for short utterances , 2015 .