DA-IICT/IIITV System for Low Resource Speech Recognition Challenge 2018

This paper presents an Automatic Speech Recognition (ASR) system, in the Gujarati language, developed for Low Resource Speech Recognition Challenge for Indian Languages in INTERSPEECH 2018. For front-end, Amplitude Modulation (AM) features are extracted using the standard and data-driven auditory filterbanks. Recurrent Neural Network Language Models (RNNLM) are used for this task. There is a relative improvement of 36.18 % and 40.95 % in perplexity on the test and blind test sets, respectively, compared to 3-gram LM. TimeDelay Neural Network (TDNN) and TDNN-Long Short-Term Memory (LSTM) models are employed for acoustic modeling. The statistical significance of proposed approaches is justified using a bootstrap-based % Probability of Improvement (POI) measure. RNNLM rescoring with 3-gram LM gave an absolute reduction of 0.69-1.29 % in Word Error Rate (WER) for various feature sets. AM features extracted using the gammatone filterbank (AM-GTFB) performed well on the blind test set compared to the FBANK baseline (POI>70 %). The combination of ASR systems further increased the performance with an absolute reduction of 1.89 and 2.24 % in WER for test and blind test sets, respectively (100 % POI).

[1]  F. Jelinek,et al.  Perplexity—a measure of the difficulty of speech recognition tasks , 1977 .

[2]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[3]  Petros Maragos,et al.  On amplitude and frequency demodulation using energy operators , 1993, IEEE Trans. Signal Process..

[4]  David Poeppel,et al.  Concurrent encoding of frequency and amplitude modulation in human auditory cortex: MEG evidence. , 2006, Journal of neurophysiology.

[5]  Alfred Mertins,et al.  Analysis and design of gammatone signal models. , 2009, The Journal of the Acoustical Society of America.

[6]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[7]  Haihua Xu,et al.  Minimum Bayes Risk decoding and system combination based on a recursion for edit distance , 2011, Comput. Speech Lang..

[8]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[9]  Marie-Francine Moens,et al.  A survey on the application of recurrent neural networks to statistical language modeling , 2015, Comput. Speech Lang..

[10]  R. Chitturi,et al.  Development of Indian Language Speech Databases for Large Vocabulary Speech Recognition Systems , 2005 .

[11]  Hemant A. Patil,et al.  Novel Unsupervised Auditory Filterbank Learning Using Convolutional RBM for Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  David Poeppel,et al.  Neuronal oscillations and speech perception: critical-band temporal envelopes are the essence , 2013, Front. Hum. Neurosci..

[13]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[14]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[15]  Hardik B Sailor,et al.  Auditory feature representation using convolutional restricted Boltzmann machine and Teager energy operator for speech recognition. , 2017, The Journal of the Acoustical Society of America.

[16]  R. Schlauch,et al.  Basilar membrane nonlinearity and loudness. , 1998, The Journal of the Acoustical Society of America.

[17]  Hemant A. Patil,et al.  Development of speech corpora in Gujarati and Marathi for phonetic transcription , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[18]  Brian Kingsbury,et al.  Multilingual representations for low resource speech recognition and keyword search , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[19]  Mark J. F. Gales,et al.  Recurrent neural network language model training with noise contrastive estimation for speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Hermann Ney,et al.  Bootstrap estimates for confidence intervals in ASR performance evaluation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  R V Shannon,et al.  Speech Recognition with Primarily Temporal Cues , 1995, Science.

[22]  Yiming Wang,et al.  Low Latency Acoustic Modeling Using Temporal Convolution and LSTMs , 2018, IEEE Signal Processing Letters.

[23]  R. Plomp,et al.  Effect of temporal envelope smearing on speech reception. , 1994, The Journal of the Acoustical Society of America.

[24]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[25]  Thomas Quatieri,et al.  Discrete-Time Speech Signal Processing: Principles and Practice , 2001 .

[26]  Avni Rajpal,et al.  Unsupervised Filterbank Learning for Speech-based Access System for Agricultural Commodity , 2017, 2017 Ninth International Conference on Advances in Pattern Recognition (ICAPR).

[27]  Alan W Black,et al.  The Festvox Indic Frontend for Grapheme-to-Phoneme Conversion , 2016 .

[28]  J. F. Kaiser,et al.  On a simple algorithm to calculate the 'energy' of a signal , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[29]  Fan-Gang Zeng,et al.  Speech recognition with amplitude and frequency modulations. , 2005, Proceedings of the National Academy of Sciences of the United States of America.