Noise Robust End-to-End Speech Recognition for Bangla Language

Robust speech recognition system is crucial for real-world applications and speech signal generally contains considerable amount of noise. We propose an end-to-end deep learning approach leveraging current progresses in Automatic Speech Recognition system to recognize continuous Bangla speech for noisy environments. We improve the robustness of our model through data augmentation and deep model architecture. We evaluate our model on an internal and two available datasets. We achieve impressive result on both clean read and noisy speech data. Our model achieves 10.65 % and 34.83 % CER on CRBLP (clean read) and Babel (noisy) respectively, which is state of the art for continuous Bangla speech recognition to the best of our knowledge. For our internal data set, we achieved 12.31 % and 9.15 % CER on clean and noisy speech respectively.

[1]  A. Algorithms Online and Linear-Time Attention by Enforcing Monotonic Alignments , 2017 .

[2]  Mumit Khan,et al.  Isolated and continuous bangla speech recognition: implementation, performance and application perspective , 2007 .

[3]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  S. Moss Listen , 2017 .

[6]  Philip C. Woodland,et al.  Very deep convolutional neural networks for robust speech recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[7]  Ying Zhang,et al.  Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks , 2016, INTERSPEECH.

[8]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[10]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[11]  Yu Zhang,et al.  Very deep convolutional networks for end-to-end speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Wojciech Zaremba,et al.  An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.

[13]  Tara N. Sainath,et al.  A Comparison of Sequence-to-Sequence Models for Speech Recognition , 2017, INTERSPEECH.

[14]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[15]  Mohammad Nuruzzaman Bhuiyan,et al.  Automatic Speech Recognition Technique for Bangla Words , 2013 .

[16]  Ghulam Muhammad,et al.  Automatic speech recognition for Bangla digits , 2009, 2009 12th International Conference on Computers and Information Technology.

[17]  Md. Mijanur Rahman,et al.  Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech Recognition , 2013, ArXiv.

[18]  Zhiheng Huang,et al.  Residual Convolutional CTC Networks for Automatic Speech Recognition , 2017, ArXiv.

[19]  Alexander M. Rush,et al.  Sequence-to-Sequence Learning as Beam-Search Optimization , 2016, EMNLP.

[20]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[21]  Anup Kumar Paul,et al.  Bangla Speech Recognition System Using LPC and ANN , 2009, 2009 Seventh International Conference on Advances in Pattern Recognition.

[22]  Tara N. Sainath,et al.  Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[23]  Samy Bengio,et al.  An Online Sequence-to-Sequence Model Using Partial Conditioning , 2015, NIPS.

[24]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[25]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Santanu Phadikar,et al.  An Ensemble Learning-Based Bangla Phoneme Recognition System Using LPCC-2 Features , 2018 .

[28]  Md Saiful Islam,et al.  Bengali speech recognition: A double layered LSTM-RNN approach , 2017, 2017 20th International Conference of Computer and Information Technology (ICCIT).

[29]  Navdeep Jaitly,et al.  Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition , 2012, INTERSPEECH.

[30]  Firoj Alam,et al.  Development of annotated Bangla speech corpora , 2010, SLTU.

[31]  M. A. H. Akhand,et al.  Acoustic modeling using deep belief network for Bangla speech recognition , 2015, 2015 18th International Conference on Computer and Information Technology (ICCIT).

[32]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[33]  Tara N. Sainath,et al.  Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization , 2012, INTERSPEECH.

[34]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[35]  Matt Shannon,et al.  Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping , 2017, INTERSPEECH.

[36]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[38]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[39]  Gabriel Synnaeve,et al.  Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[40]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[41]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.