In the work presented in this paper, the recent improvements incorporated in the earlier developed Assamese spoken query (SQ) system for accessing the price of agricultural commodities are discussed. The developed SQ system consists of interactive voice response (IVR) and automatic speech recognition (ASR) modules. These are developed using open source resources. The speech data used for developing the ASR system was collected in the field conditions, thus contained significantly high level of background noise. On account of the background noise, the recognition performance of earlier version of the SQ system was severely affected. In order to deal with that, a front-end noise suppression module-based on zero frequency filtering has been added in the current version. Furthermore, we have also incorporated the subspace Gaussian mixture (SGMM) and deep neural network (DNN)-based acoustic modeling approaches. These techniques are found to be more effective than the Gaussian mixture model (GMM)-based approach which was employed in the previous version. The combination of noise removal and DNN-based acoustic modeling is found to result in a relative improvement of almost 32% in word error rate in comparison to the earlier reported GMM-HMM-based ASR system. The earlier SQ system was designed expecting the users' queries in form of isolated words only and, therefore, a high degraded recognition performance was noted whenever the queries were in the form of continuous sentences. In order to overcome that, we present a simple technique exploiting the inherent patterns in the user queries. These patterns are then incorporated in the employed language model. The modified language model is observed to result in significant improvements in the recognition performances in case of continuous queries.
[1]
Adam Belloum,et al.
A real-time speech recognition architecture for a multi-channel interactive voice response system
,
1995,
1995 International Conference on Acoustics, Speech, and Signal Processing.
[2]
Daniel Povey,et al.
The Kaldi Speech Recognition Toolkit
,
2011
.
[3]
Lawrence R. Rabiner,et al.
Applications of voice processing to telecommunications
,
1994,
Proc. IEEE.
[4]
S. R. Mahadeva Prasanna,et al.
Low Complexity On-Line Adaptation Techniques in Context of Assamese Spoken Query System
,
2015,
J. Signal Process. Syst..
[5]
Bayya Yegnanarayana,et al.
Epoch Extraction From Speech Signals
,
2008,
IEEE Transactions on Audio, Speech, and Language Processing.
[6]
Kai Feng,et al.
The subspace Gaussian mixture model - A structured model for speech recognition
,
2011,
Comput. Speech Lang..
[7]
Lawrence R. Rabiner,et al.
An algorithm for determining the endpoints of isolated utterances
,
1975,
Bell Syst. Tech. J..
[8]
Yee Whye Teh,et al.
A Fast Learning Algorithm for Deep Belief Nets
,
2006,
Neural Computation.
[9]
Dong Yu,et al.
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
,
2012,
IEEE Transactions on Audio, Speech, and Language Processing.
[10]
S. R. Mahadeva Prasanna,et al.
Foreground Speech Segmentation using Zero Frequency Filtered Signal
,
2012,
INTERSPEECH.
[11]
L. Rabiner,et al.
An algorithm for determining the endpoints of isolated utterances
,
1974,
The Bell System Technical Journal.