Studying the Effects of Feature Extraction Settings on the Accuracy and Memory Requirements of Neural Networks for Keyword Spotting

Due to the always-on nature of keyword spotting (KWS) systems, low power consumption micro-controller units (MCU) are the best choices as deployment devices. However, small computation power and memory budget of MCUs can harm the accuracy requirements. Although, many studies have been conducted to design small memory footprint neural networks to address this problem, the effects of different feature extraction settings are rarely studied. This work addresses this important question by first, comparing six of the most popular and state of the art neural network architectures for KWS on the Google Speech-Commands dataset. Then, keeping the network architectures unchanged it performs comprehensive investigations on the effects of different frequency transformation settings, such as number of used mel-frequency cepstrum coefficients (MFCCs) and length of the stride window, on the accuracy and memory footprint (RAM/ROM) of the models. The results show different preprocessing settings can change the accuracy and RAM/ROM requirements significantly of the models. Furthermore, it is shown that DS-CNN outperforms the other architectures in terms of accuracy with a value of 93.47% with least amount of ROM requirements, while the GRU outperforms all other networks with an accuracy of 91.02% with smallest RAM requirements.

[1]  Ian McGraw,et al.  Personalized speech recognition on mobile devices , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Rohit Prabhavalkar,et al.  Compressing deep neural networks using a rank-constrained topology , 2015, INTERSPEECH.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Jeong-Sik Park,et al.  Acoustic interference cancellation for a voice-driven interface in smart TVs , 2013, IEEE Transactions on Consumer Electronics.

[5]  Nikko Strom,et al.  Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[6]  Sokratis Kartakis,et al.  CAMILE: Controlling AmI Lights Easily , 2008, PETRA '08.

[7]  Douglas D. O'Shaughnessy,et al.  Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition , 1999, IEEE Trans. Speech Audio Process..

[8]  Ibrahim Patel,et al.  Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral Decomposion Technique , 2010 .

[9]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[10]  Shuicheng Yan,et al.  Dual Path Networks , 2017, NIPS.

[11]  Yundong Zhang,et al.  Hello Edge: Keyword Spotting on Microcontrollers , 2017, ArXiv.

[12]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Roland Siegwart,et al.  Voice enabled interface for interactive tour-guide robots , 2002 .

[14]  Geoffrey Zweig,et al.  The microsoft 2016 conversational speech recognition system , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Jürgen Schmidhuber,et al.  An Application of Recurrent Neural Networks to Discriminative Keyword Spotting , 2007, ICANN.

[16]  Alan Eardley,et al.  Studying the Energy Consumption in Mobile Devices , 2016, FNC/MobiSPC.

[17]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[18]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[19]  Samy Bengio,et al.  Discriminative keyword spotting , 2009, Speech Commun..

[20]  Stefan Wermter,et al.  Improving Domain-independent Cloud-Based Speech Recognition with Domain-Dependent Phonetic Post-Processing , 2014, AAAI.

[21]  J. Pennebaker,et al.  Are Women Really More Talkative Than Men? , 2007, Science.

[22]  Heidi Christensen,et al.  homeService: Voice-enabled assistive technology in the home using cloud-based automatic speech recognition , 2013, SLPAT.

[23]  C. Teacher,et al.  Experimental, limited vocabulary, speech recognizer , 1967, IEEE Transactions on Audio and Electroacoustics.

[24]  Sankaran Panchapagesan,et al.  Model Compression Applied to Small-Footprint Keyword Spotting , 2016, INTERSPEECH.

[25]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[26]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Ebru Arisoy,et al.  Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.