Improvements in IITG Assamese Spoken Query System: Background Noise Suppression and Alternate Acoustic Modeling

In this work, we present the recent improvements incorporated in the earlier developed Assamese spoken query (SQ) system for accessing the price of agricultural commodities. The SQ system consists of interactive voice response (IVR) and automatic speech recognition (ASR) modules developed using open source resources. The speech data used for training the ASR system has a high level of background noise since it is collected in field conditions. In the earlier version of the SQ system, this background noise had an adverse effect on the recognition performance. In the improved version, a background noise suppression module based on zero frequency filtering is added before feature extraction. In addition to this, we have also explored the recently reported subspace Gaussian mixture (SGMM) and deep neural network (DNN) based acoustic modeling approaches. These techniques have been reported to be more powerful than the GMM-HMM approach which was employed in the previous version. Further, the foreground separated speech data is used while learning the acoustic models for all systems. The amalgamation of noise removal and SGMM/DNN-based acoustic modeling is found to result in a relative improvement of 39 % in word error rate in comparison to the earlier reported GMM-HMM-based ASR system. The on-line testing of the developed SQ system (done with the help of real farmers) is also presented in this work. Some efforts are made to quantify the usability of the developed SQ system and the explored enhancements are noted to be helpful on that front too.

[1]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  S. R. Mahadeva Prasanna,et al.  Foreground Speech Segmentation using Zero Frequency Filtered Signal , 2012, INTERSPEECH.

[3]  Yun Tang,et al.  An investigation of subspace modeling for phonetic and speaker variability in automatic speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Kai Feng,et al.  The subspace Gaussian mixture model - A structured model for speech recognition , 2011, Comput. Speech Lang..

[5]  Bayya Yegnanarayana,et al.  Characterization of Glottal Activity From Speech Signals , 2009, IEEE Signal Processing Letters.

[6]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[7]  Bayya Yegnanarayana,et al.  Epoch Extraction From Speech Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[9]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[10]  Lawrence R. Rabiner,et al.  An algorithm for determining the endpoints of isolated utterances , 1975, Bell Syst. Tech. J..

[11]  Lawrence R. Rabiner,et al.  Applications of voice processing to telecommunications , 1994, Proc. IEEE.

[12]  Jason D. Williams Spoken dialogue systems: Challenges, and opportunities for research , 2009, ASRU.

[13]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[14]  S. R. Mahadeva Prasanna,et al.  Low Complexity On-Line Adaptation Techniques in Context of Assamese Spoken Query System , 2015, J. Signal Process. Syst..

[15]  Haihua Xu,et al.  Minimum Bayes Risk decoding and system combination based on a recursion for edit distance , 2011, Comput. Speech Lang..

[16]  Adam Belloum,et al.  A real-time speech recognition architecture for a multi-channel interactive voice response system , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.