Crnn-Ctc Based Mandarin Keywords Spotting

Deep learning based approaches have greatly improved the performance of spoken keyword spotting (KWS). However, KWS of different languages should have their own corresponding modeling units to optimize the performance. In this paper, we propose an end-to-end Mandarin KWS system using Convolutional Recurrent Neural Network with the Connectionist Temporal Classification (CTC) loss function (CRNN-CTC). The tonal syllables are adopted as modeling units. Experiments on AISHELL-2 datasets showed that the proposed approach on the tasks of 13 keywords and 20 keywords can achieve a false rejection rate of 5.35% with 0.26 FA/hour and 6.37% with 0.17 FA/hour, respectively.

[1]  Bin Liu,et al.  End-to-end keywords spotting based on connectionist temporal classification for Mandarin , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[2]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[3]  Bin Liu,et al.  Loss and Double-edge-triggered Detector for Robust Small-footprint Keyword Spotting , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Zhiwei Shuang,et al.  Improved Mandarin Keyword Spotting Using Confusion Garbage Model , 2010, 2010 20th International Conference on Pattern Recognition.

[5]  Mireia Díez,et al.  High-performance Query-by-Example Spoken Term Detection on the SWS 2013 evaluation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Hui Bu,et al.  AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale , 2018, ArXiv.

[7]  Pu Yunming,et al.  The Genetic Convolutional Neural Network Model Based on Random Sample , 2015 .

[8]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[9]  Tara N. Sainath,et al.  Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[10]  Jia Liu,et al.  Fusing multiple systems into a compact lattice index for chinese spoken term detection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Jürgen Schmidhuber,et al.  An Application of Recurrent Neural Networks to Discriminative Keyword Spotting , 2007, ICANN.

[12]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[13]  Janet M. Baker,et al.  Application of large vocabulary continuous speech recognition to topic and speaker identification using telephone speech , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Hao Wu,et al.  Context dependent syllable acoustic model for continuous Chinese speech recognition , 2007, INTERSPEECH.

[15]  Yanhua Long,et al.  Keyword Spotting Based On CTC and RNN For Mandarin Chinese Speech , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[16]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[17]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Veena Karjigi,et al.  Sensitive keyword spotting for crime analysis , 2014, 2014 IEEE National Conference on Communication, Signal Processing and Networking (NCCSN).

[19]  Chao Huang,et al.  Large vocabulary Mandarin speech recognition with different approaches in modeling tones , 2000, INTERSPEECH.

[20]  Sercan Ömer Arik,et al.  Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting , 2017, INTERSPEECH.

[21]  Awni Y. Hannun,et al.  An End-to-End Architecture for Keyword Spotting and Voice Activity Detection , 2016, ArXiv.