Multitask Learning of Deep Neural Network-Based Keyword Spotting for IoT Devices

Speech-based interfaces are convenient and intuitive, and therefore, strongly preferred by Internet of Things (IoT) devices for human–computer interaction. Pre-defined keywords are typically used as a trigger to notify devices for inputting the subsequent voice commands. Keyword spotting techniques used as voice trigger mechanisms, typically model the target keyword via triphone models and non-keywords through single-state filler models. Recently, deep neural networks (DNNs) have shown better performance compared to hidden Markov models with Gaussian mixture models, in various tasks including speech recognition. However, conventional DNN-based keyword spotting methods cannot change the target keywords easily, which is an essential feature for speech-based IoT device interface. Additionally, the increase in computational requirements interferes with the use of complex filler models in DNN-based keyword spotting systems, which diminishes the accuracy of such systems. In this paper, we propose a novel DNN-based keyword spotting system that alters the keyword on the fly and utilizes triphone and monophone acoustic models in an effort to reduce computational complexity and increase generalization performance. The experimental results using the FFMTIMIT corpus show that the error rate of the proposed method was reduced by 36.6%.

[1]  Abhinav Thanda,et al.  Multi-task Learning Of Deep Neural Networks For Audio Visual Automatic Speech Recognition , 2017, ArXiv.

[2]  Javier Tejedor,et al.  SPANISH KEYWORD SPOTTING SYSTEM BASED ON FILLER MODELS, PSEUDO N-GRAM LANGUAGE MODEL AND A CONFIDENCE MEASURE , 2006 .

[3]  David Blaauw,et al.  A fixed-point neural network for keyword detection on resource constrained hardware , 2015, 2015 IEEE Workshop on Signal Processing Systems (SiPS).

[4]  Michael Weintraub,et al.  Keyword-spotting using SRI's DECIPHER large-vocabulary speech-recognition system , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Marius-Calin Silaghi,et al.  Spotting Subsequences Matching an HMM Using the Average Observation Probability Criteria with Application to Keyword Spotting , 2005, AAAI.

[6]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[7]  Sukmoon Chang,et al.  A voice trigger system using keyword and speaker recognition for mobile devices , 2009, IEEE Transactions on Consumer Electronics.

[8]  L. G. Miller,et al.  Improvements and applications for key word recognition using hidden Markov modeling techniques , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[9]  Marco Gori,et al.  A survey of hybrid ANN/HMM models for automatic speech recognition , 2001, Neurocomputing.

[10]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[11]  Vivek Tyagi Hybrid context dependent CD-DNN-HMM Keyword Spotting (KWS) in speech conversations , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[12]  Nikko Strom,et al.  Compressed Time Delay Neural Network for Small-Footprint Keyword Spotting , 2017, INTERSPEECH.

[13]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, NIPS.

[14]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[15]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[16]  I-Fan Chen,et al.  A hybrid HMM/DNN approach to keyword spotting of short words , 2013, INTERSPEECH.

[17]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Xiaohua Zeng,et al.  Design and performance evaluation of voice activated wireless home devices , 2006, IEEE Transactions on Consumer Electronics.

[19]  Tara N. Sainath,et al.  Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[20]  Andrés Marín López,et al.  Seamless human-device interaction in the internet of things , 2017, IEEE Transactions on Consumer Electronics.

[21]  Arindam Mandal,et al.  Multi-Task Learning and Weighted Cross-Entropy for DNN-Based Keyword Spotting , 2016, INTERSPEECH.

[22]  Takashi Tsuzuki,et al.  A new digital TV interface employing speech recognition , 2003, IEEE Trans. Consumer Electron..

[23]  Zoran Saric,et al.  Hands-free voice communication with TV , 2011, IEEE Transactions on Consumer Electronics.

[24]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[25]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).