论文信息 - Korean Grapheme Unit-based Speech Recognition Using Attention-CTC Ensemble Network

Korean Grapheme Unit-based Speech Recognition Using Attention-CTC Ensemble Network

This study proposes an end-to-end speech recognition method based on the Attention-CTC ensemble network that uses Korean graphemes as recognition units. End-to-end speech recognition is a method that allows the processing of procedures that involved a number of modules, including the DNN-HMM-based acoustic model, the N-gram-based language model, and the WFST-based decoding network, with a single DNN network. To predict the outputs of the end-to-end model, this study utilizes grapheme-unit output structures. Building a network based on graphemes enables effective learning by reducing the number of output parameters to be predicted from 11,172 to 49. Towards this aim, this study designed an end-to-end model by combining the connectionist temporal classification (CTC), the DNN network structure primarily used in end-to-end learning, and the attention network model. The experiment resulted in a 10.5% syllable error rate.

[1] Wei Chu,et al. Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[2] Hairong Liu,et al. Exploring neural transducers for end-to-end speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[3] Wu Zhaohui,et al. Combining MFCC and Pitch to Enhance the Performance of the Gender Recognition , 2006, 2006 8th international Conference on Signal Processing.

[4] Jeong-Sik Park,et al. Long short-term memory recurrent neural network-based acoustic model using connectionist temporal classification on a large-scale training corpus , 2017, China Communications.

[5] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[6] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[7] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[8] Tomoki Toda,et al. Back-Translation-Style Data Augmentation for end-to-end ASR , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[9] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[11] Enno Ohlebusch,et al. CHAINER: Software for Comparing Genomes , 2004 .

[12] Tomoki Toda,et al. Multi-Head Decoder for End-to-End Speech Recognition , 2018, INTERSPEECH.

[13] Sanjeev Khudanpur,et al. End-to-end Speech Recognition Using Lattice-free MMI , 2018, INTERSPEECH.