Investigation of different acoustic modeling techniques for low resource Indian language data

In this paper, we investigate the performance of deep neural network (DNN) and Subspace Gaussian mixture model (SGMM) in low-resource condition. Even though DNN outperforms SGMM and continuous density hidden Markov models (CDHMM) for high-resource data, it degrades in performance while modeling low-resource data. Our experimental results show that SGMM outperforms DNN for limited transcribed data. To resolve this problem in DNN, we propose to train DNN containing bottleneck layer in two stages: First stage involves extraction of bottleneck features. In second stage, the extracted bottleneck features from first stage are used to train DNN having bottleneck layer. All our experiments are performed using two Indian languages (Tamil & Hindi) in Mandi database. Our proposed method shows improved performance when compared to baseline SGMM and DNN models for limited training data.

[1]  Florian Metze,et al.  Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training , 2013, INTERSPEECH.

[2]  Dong Yu,et al.  Exploiting sparseness in deep neural networks for large vocabulary speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Steve Renals,et al.  Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[5]  Kai Feng,et al.  The subspace Gaussian mixture model - A structured model for speech recognition , 2011, Comput. Speech Lang..

[6]  Yajie Miao,et al.  Kaldi+PDNN: Building DNN-based ASR Systems with Kaldi and PDNN , 2014, ArXiv.

[7]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[8]  Yu Zhang,et al.  Extracting deep neural network bottleneck features using low-rank matrix factorization , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[10]  Hynek Hermansky,et al.  Cross-lingual and multi-stream posterior features for low resource LVCSR systems , 2010, INTERSPEECH.

[11]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .