Bangla Speech Recognition for Voice Search

In this work, different Gaussian Mixture Model-Hidden Markov Model(GMM-HMM) based and Deep Neural Network (DNN-HMM) based models have been analyzed for speech recognition in Bangla language to build a voice search module for search engine pipilika 1. A small corpus of 9 hours of speech recordings from 49 different speakers was prepared for this work consisting of a vocabulary of 500 unique words. The lowest Word Error Rate(WER) for (GMM-HMM) based model was 3.96% and for (DNN-HMM) based model was 5.30%. To our best knowledge, this is the lowest WER for Bangla speech recognition for such vocabulary size.

[1]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Tatsuya Kawahara,et al.  Recent Development of Open-Source Speech Recognition Engine Julius , 2009 .

[3]  Steve Young,et al.  The HTK book , 1995 .

[4]  Md. Mijanur Rahman,et al.  Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech Recognition , 2013, ArXiv.

[5]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Hsiao-Wuen Hon,et al.  An overview of the SPHINX speech recognition system , 1990, IEEE Trans. Acoust. Speech Signal Process..

[7]  Paul Lamere,et al.  Sphinx-4: a flexible open source framework for speech recognition , 2004 .

[8]  P. K. Das,et al.  Bangla Speech-to-Text conversion using SAPI , 2012, 2012 International Conference on Computer and Communication Engineering (ICCCE).

[9]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[10]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[11]  Mumit Khan,et al.  Isolated and continuous bangla speech recognition: implementation, performance and application perspective , 2007 .

[12]  Razvan Pascanu,et al.  Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks , 2013, ECML/PKDD.

[13]  Md. Saiful Islam,et al.  Comprehending Real Numbers: Development of Bengali Real Number Speech Corpus , 2018, ArXiv.

[14]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[15]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[16]  Xiaohui Zhang,et al.  Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Md Saiful Islam,et al.  A noble approach for recognizing Bangla real number automatically using CMU Sphinx4 , 2016, 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV).

[18]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[19]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.