Event-driven Pipeline for Low-latency Low-compute Keyword Spotting and Speaker Verification System

This work presents an event-driven acoustic sensor processing pipeline to power a low-resource voice-activated smart assistant. The pipeline includes four major steps; namely localization, source separation, keyword spotting (KWS) and speaker verification (SV). The pipeline is driven by a front-end binaural spiking silicon cochlea sensor. The timing information carried by the output spikes of the cochlea provide spatial cues for localization and source separation. Spike features are generated with low latencies from the separated source spikes and are used by both KWS and SV which rely on state-of-the-art deep recurrent neural network architectures with a small memory footprint. Evaluation on a self-recorded event dataset based on TIDIGITS shows accuracies of over 93% and 88% on KWS and SV respectively, with minimum system latency of 5 ms on a limited resource device.

[1]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[2]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[3]  Sriram Ganapathy,et al.  On Convolutional LSTM Modeling for Joint Wake-Word Detection and Text Dependent Speaker Verification , 2018, INTERSPEECH.

[4]  Tobi Delbrück,et al.  DeltaRNN: A Power-efficient Recurrent Neural Network Accelerator , 2018, FPGA.

[5]  André van Schaik,et al.  AER EAR: A Matched Silicon Cochlea Pair With Address Event Representation Interface , 2005, IEEE Transactions on Circuits and Systems I: Regular Papers.

[6]  Shih-Chii Liu,et al.  Speaker Activity Detection and Minimum Variance Beamforming for Source Separation , 2018, INTERSPEECH.

[7]  Tobi Delbrück,et al.  22.5 A 0.5V 55µW 64×2-channel binaural silicon cochlea for event-driven stereo-audio sensing , 2016, ISSCC.

[8]  Tobi Delbrück,et al.  Real-time speaker identification using the AEREAR2 event-based silicon cochlea , 2012, 2012 IEEE International Symposium on Circuits and Systems.

[9]  Tara N. Sainath,et al.  Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Yun Lei,et al.  Feature fusion for high-accuracy keyword spotting , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Shiliang Zhang,et al.  Compact Feedforward Sequential Memory Networks for Small-footprint Keyword Spotting , 2018, INTERSPEECH.

[13]  Zhe He,et al.  An event-driven probabilistic model of sound source localization using cochlea spikes , 2018, 2018 IEEE International Symposium on Circuits and Systems (ISCAS).

[14]  Alain de Cheveigné,et al.  WHISPER: Wirelessly Synchronized Distributed Audio Sensor Platform , 2017, 2017 IEEE 42nd Conference on Local Computer Networks Workshops (LCN Workshops).

[15]  Vikrant Singh Tomar,et al.  Efficient keyword spotting using time delay neural networks , 2018, INTERSPEECH.

[16]  Tobi Delbruck,et al.  Feature Representations for Neuromorphic Audio Spike Streams , 2018, Front. Neurosci..

[17]  Liu Liu,et al.  Double Joint Bayesian Modeling of DNN Local I-Vector for Text Dependent Speaker Verification with Random Digit Strings , 2018, INTERSPEECH.

[18]  Yang Feng,et al.  Collaborative Learning for Language and Speaker Recognition , 2016, ArXiv.

[19]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[20]  Shih-Chii Liu,et al.  Speaker-independent isolated digit recognition using an AER silicon cochlea , 2011, 2011 IEEE Biomedical Circuits and Systems Conference (BioCAS).

[21]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[22]  Aurel A. Lazar,et al.  A 1μW voice activity detector using analog feature extraction and digital deep neural network , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[23]  Andre van Schaik,et al.  Asynchronous Binaural Spatial Audition Sensor With 2$\,\times\,$64$\,\times\,$4 Channel Output , 2014, IEEE Transactions on Biomedical Circuits and Systems.

[24]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[25]  Thad Hughes,et al.  Supervised Noise Reduction for Multichannel Keyword Spotting , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Richard F. Lyon,et al.  Trainable frontend for robust and far-field keyword spotting , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Tobi Delbrück,et al.  A 0.5V 55μW 64×2-channel binaural silicon cochlea for event-driven stereo-audio sensing , 2016, 2016 IEEE International Solid-State Circuits Conference (ISSCC).