A Convenient and Extensible Offline Chinese Speech Recognition System Based on Convolutional CTC Networks

Deep learning methods have been widely used in automatic speech recognition (ASR), and they have achieved significant improvement in accuracy. The deep CNN structure can significantly improve the performance of the HMM speech recognition system. In addition, the CNN model has a good translation invariance in the time-frequency domain, making the model more robust (noise resistance). In this paper, we use the acoustic model based on CNN+CTC+Self-Attention and the corresponding language model to construct an end-to-end Chinese speech recognition system as a pre-training model. On this basis, we do not repeat the training of acoustic models. We propose a method combining Levenshtein Distance and hashing method to construct an off-line Chinese speech recognition system for a specific scene. The experimental results show that using the deep convolution CTC (Connectionist Temporal Classification) time series automatic speech recognition model, we have achieved a total error rate (WER) of 18% on the standard data set THRHS-30 and Free ST Chinese Mandarin Corpus. In addition, the combination of Levenshtein Distance and hash language model can achieve an accuracy of more than 90% on specific phrases. The whole model has strong expansibility and practicability.

[1]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[2]  Yanmin Qian,et al.  Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[5]  Dimitri Palaz,et al.  Analysis of CNN-based speech recognition system using raw speech as input , 2015, INTERSPEECH.

[6]  Alex Bateman,et al.  An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[7]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[8]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[9]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[10]  M. S. Ryan,et al.  The Viterbi Algorithm 1 1 The Viterbi Algorithm . , 2009 .

[11]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[14]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[15]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.