Biologically-Inspired Spike-Based Automatic Speech Recognition of Isolated Digits Over a Reproducing Kernel Hilbert Space

This paper presents a novel real-time dynamic framework for quantifying time-series structure in spoken words using spikes. Audio signals are converted into multi-channel spike trains using a biologically-inspired leaky integrate-and-fire (LIF) spike generator. These spike trains are mapped into a function space of infinite dimension, i.e., a Reproducing Kernel Hilbert Space (RKHS) using point-process kernels, where a state-space model learns the dynamics of the multidimensional spike input using gradient descent learning. This kernelized recurrent system is very parsimonious and achieves the necessary memory depth via feedback of its internal states when trained discriminatively, utilizing the full context of the phoneme sequence. A main advantage of modeling nonlinear dynamics using state-space trajectories in the RKHS is that it imposes no restriction on the relationship between the exogenous input and its internal state. We are free to choose the input representation with an appropriate kernel, and changing the kernel does not impact the system nor the learning algorithm. Moreover, we show that this novel framework can outperform both traditional hidden Markov model (HMM) speech processing as well as neuromorphic implementations based on spiking neural network (SNN), yielding accurate and ultra-low power word spotters. As a proof of concept, we demonstrate its capabilities using the benchmark TI-46 digit corpus for isolated-word automatic speech recognition (ASR) or keyword spotting. Compared to HMM using Mel-frequency cepstral coefficient (MFCC) front-end without time-derivatives, our MFCC-KAARMA offered improved performance. For spike-train front-end, spike-KAARMA also outperformed state-of-the-art SNN solutions. Furthermore, compared to MFCCs, spike trains provided enhanced noise robustness in certain low signal-to-noise ratio (SNR) regime.

[1]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[2]  Badong Chen,et al.  Learning Nonlinear Generative Models of Time Series With a Kalman Filter in RKHS , 2014, IEEE Transactions on Signal Processing.

[3]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[4]  John G. Harris,et al.  Noise-Robust Automatic Speech Recognition Using a Predictive Echo State Network , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  José Carlos Príncipe,et al.  Kernel Methods on Spike Train Space for Neuroscience: A Tutorial , 2013, IEEE Signal Processing Magazine.

[6]  Hervé Bourlard,et al.  Continuous speech recognition by connectionist statistical methods , 1993, IEEE Trans. Neural Networks.

[7]  Shie Mannor,et al.  The kernel recursive least-squares algorithm , 2004, IEEE Transactions on Signal Processing.

[8]  Badong Chen,et al.  Universal Approximation with Convex Optimization: Gimmick or Reality? [Discussion Forum] , 2015, IEEE Computational Intelligence Magazine.

[9]  Wulfram Gerstner,et al.  SPIKING NEURON MODELS Single Neurons , Populations , Plasticity , 2002 .

[10]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[11]  Robert F. Harrison,et al.  A kernel based adaline , 1999, ESANN.

[12]  Vladimir Solmon,et al.  The estimation of stochastic context-free grammars using the Inside-Outside algorithm , 2003 .

[13]  Kan Li,et al.  The Kernel Adaptive Autoregressive-Moving-Average Algorithm , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[14]  Benjamin Schrauwen,et al.  Recognition of Isolated Digits using a Liquid State Machine , 2005 .

[15]  Harvey F. Silverman,et al.  Combining hidden Markov model and neural network classifiers , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[16]  J J Hopfield,et al.  What is a moment? Transient synchrony as a collective mechanism for spatiotemporal integration. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Herbert Jaeger,et al.  The''echo state''approach to analysing and training recurrent neural networks , 2001 .

[18]  L. Ralaivola,et al.  Time series filtering, smoothing and learning using the kernel Kalman filter , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[19]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[20]  Robert A. Legenstein,et al.  Methods for Estimating the Computational Power and Generalization Capability of Neural Microcircuits , 2004, NIPS.

[21]  Henry Markram,et al.  Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations , 2002, Neural Computation.

[22]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Mark D Skowronski,et al.  Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition. , 2004, The Journal of the Acoustical Society of America.

[24]  Liam McDaid,et al.  SWAT: A Spiking Neural Network Training Algorithm for Classification Problems , 2010, IEEE Transactions on Neural Networks.

[25]  S. Haykin,et al.  Kernel Least‐Mean‐Square Algorithm , 2010 .

[26]  H. Bourlard,et al.  Links Between Markov Models and Multilayer Perceptrons , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  A. Hawkes Spectra of some self-exciting and mutually exciting point processes , 1971 .

[28]  Weifeng Liu,et al.  Extended Kernel Recursive Least Squares Algorithm , 2009, IEEE Transactions on Signal Processing.

[29]  Nils Bertschinger,et al.  Real-Time Computation at the Edge of Chaos in Recurrent Neural Networks , 2004, Neural Computation.

[30]  Kan Li,et al.  Restoring Behavior via Inverse Neurocontroller in a Lesioned Cortical Spiking Model Driving a Virtual Arm , 2016, Front. Neurosci..

[31]  Kan Li,et al.  Automatic insect recognition using optical flight dynamics modeled by kernel adaptive ARMA network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Kan Li,et al.  Flight dynamics modeling and recognition using finite state machine for automatic insect recognition , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[33]  R. Meddis Simulation of mechanical to neural transduction in the auditory receptor. , 1986, The Journal of the Acoustical Society of America.

[34]  Kuldip K. Paliwal,et al.  Spectral subband centroid features for speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[35]  José Carlos Príncipe,et al.  A Reproducing Kernel Hilbert Space Framework for Spike Train Signal Processing , 2009, Neural Computation.

[36]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[37]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[38]  Esther Levin,et al.  Word recognition using hidden control neural architecture , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[39]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[40]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[41]  Yoram Singer,et al.  Spikernels: Predicting Arm Movements by Embedding Population Spike Rate Patterns in Inner-Product Spaces , 2005, Neural Computation.

[42]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[43]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[44]  A. Waibel,et al.  Connectionist Viterbi training: a new hybrid method for continuous speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[45]  José Carlos Príncipe,et al.  Strictly Positive-Definite Spike Train Kernels for Point-Process Divergences , 2012, Neural Computation.

[46]  Kan Li,et al.  Automatic plant identification using stem automata , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[47]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[48]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[49]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[50]  Yong Zhang,et al.  A Digital Liquid State Machine With Biologically Inspired Learning and Its Application to Speech Recognition , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[51]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[52]  Hervé Bourlard,et al.  Continuous speech recognition using multilayer perceptrons with hidden Markov models , 1990, International Conference on Acoustics, Speech, and Signal Processing.