Korean Grapheme Unit-based Speech Recognition Using Attention-CTC Ensemble Network

This study proposes an end-to-end speech recognition method based on the Attention-CTC ensemble network that uses Korean graphemes as recognition units. End-to-end speech recognition is a method that allows the processing of procedures that involved a number of modules, including the DNN-HMM-based acoustic model, the N-gram-based language model, and the WFST-based decoding network, with a single DNN network. To predict the outputs of the end-to-end model, this study utilizes grapheme-unit output structures. Building a network based on graphemes enables effective learning by reducing the number of output parameters to be predicted from 11,172 to 49. Towards this aim, this study designed an end-to-end model by combining the connectionist temporal classification (CTC), the DNN network structure primarily used in end-to-end learning, and the attention network model. The experiment resulted in a 10.5% syllable error rate.

[1]  Wei Chu,et al.  Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[2]  Hairong Liu,et al.  Exploring neural transducers for end-to-end speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[3]  Wu Zhaohui,et al.  Combining MFCC and Pitch to Enhance the Performance of the Gender Recognition , 2006, 2006 8th international Conference on Signal Processing.

[4]  Jeong-Sik Park,et al.  Long short-term memory recurrent neural network-based acoustic model using connectionist temporal classification on a large-scale training corpus , 2017, China Communications.

[5]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[6]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[7]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[8]  Tomoki Toda,et al.  Back-Translation-Style Data Augmentation for end-to-end ASR , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[9]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[11]  Enno Ohlebusch,et al.  CHAINER: Software for Comparing Genomes , 2004 .

[12]  Tomoki Toda,et al.  Multi-Head Decoder for End-to-End Speech Recognition , 2018, INTERSPEECH.

[13]  Sanjeev Khudanpur,et al.  End-to-end Speech Recognition Using Lattice-free MMI , 2018, INTERSPEECH.