Unsupervised Filterbank Learning for Speech-based Access System for Agricultural Commodity

This paper presents an automatic speech recognition (ASR) system developed as a part of a speech-based access system for an agricultural commodity in the Gujarati language. Speech database was collected from the farmers in the villages of Gujarat state (India) with various dialectal variations and real noisy acoustic environments. We have used the recently proposed Convolutional Restricted Boltzmann Machine (ConvRBM) to learn the filterbank as a front-end. Self-taught learning framework is applied to train Conv RBM using extra Gujarati speech database other than an agricultural commodity. Stochastic data sweeping technique is used to enhance the training speed of ConvRBM. Experiments using time delay deep neural networks (TDNNs) show that ConvRBM features give relative improvements of 5.5% in WER compared to the Mel filterbank features. The system-level combination of both features further improves the performance (3.55 % absolute reduction in WER).

[1]  S Shahnawazuddin,et al.  Assamese spoken query system to access the price of agricultural commodities , 2013, 2013 National Conference on Communications (NCC).

[2]  Tara N. Sainath,et al.  Learning filter banks within a deep neural network framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[3]  Joyanta Basu,et al.  Commodity price retrieval system in Bangla: an IVR based application , 2013, APCHI.

[4]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[5]  Shubham Sharma,et al.  Development of language resources for speech application in Gujarati and Marathi , 2014, 2014 International Conference on Asian Language Processing (IALP).

[6]  Tejas Godambe,et al.  Speech Data Acquisition for Voice based Agricultural Information Retrieval , 2011 .

[7]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[8]  Hardik B Sailor,et al.  Auditory feature representation using convolutional restricted Boltzmann machine and Teager energy operator for speech recognition. , 2017, The Journal of the Acoustical Society of America.

[9]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[10]  Wei Deng,et al.  Stochastic data sweeping for fast DNN training , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Haihua Xu,et al.  Minimum Bayes Risk decoding and system combination based on a recursion for edit distance , 2011, Comput. Speech Lang..

[12]  Hemant A. Patil,et al.  Novel Unsupervised Auditory Filterbank Learning Using Convolutional RBM for Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Dimitri Palaz,et al.  Convolutional Neural Networks-based continuous speech recognition using raw speech signal , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Hemant A. Patil,et al.  Unsupervised Deep Auditory Model Using Stack of Convolutional RBMs for Speech Recognition , 2016, INTERSPEECH.

[15]  Geoffrey E. Hinton,et al.  Learning a better representation of speech soundwaves using restricted boltzmann machines , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[17]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[18]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[19]  Srinivasan Umesh,et al.  Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain , 2014, Speech Commun..

[20]  Kishore Prahallad,et al.  A speech-based conversation system for accessing agriculture commodity prices in Indian languages , 2011, 2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays.

[21]  Hermann Ney,et al.  Acoustic modeling with deep neural networks using raw time signal for LVCSR , 2014, INTERSPEECH.

[22]  Hermann Ney,et al.  Convolutional neural networks for acoustic modeling of raw time signal in LVCSR , 2015, INTERSPEECH.

[23]  Hemant A. Patil,et al.  Filterbank learning using Convolutional Restricted Boltzmann Machine for speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Thimmaraja G. Yadava,et al.  A spoken query system for the agricultural commodity prices and weather information access in Kannada language , 2017, Int. J. Speech Technol..