Timescalenet : A Multiresolution Approach for Raw Audio Recognition

In recent years, the use of Deep Learning techniques in audio signal processing has led the scientific community to develop machine learning strategies that allow to build efficient representations from raw waveforms for machine hearing tasks. In the present paper, we show the benefit of a multi-resolution approach : TimeScaleNet aims at learning an efficient representation of a sound, by learning time dependencies both at the sample level and at the frame level. At the sample level, TimeScaleNet’s architecture introduces a new form of recurrent neural layer that acts as a learnable passband biquadratic digital IIR filterbank and self-adapts to the specific recognition task and dataset, with a large receptive field and very few learnable parameters. The obtained frame-level feature map is then processed using a residual network of depthwise separable atrous convolutions. This second scale of analysis allows to encode the time fluctuations at the frame timescale, in different learnt pooled frequency bands. In the present paper, TimeScaleNet is tested using the Speech Commands Dataset. We report a very high mean accuracy of 94.87 ± 0.24% (macro averaged F1-score : 94.9 ± 0.24%) for this particular task.

[1]  Lukasz Kaiser,et al.  Depthwise Separable Convolutions for Neural Machine Translation , 2017, ICLR.

[2]  Juhan Nam,et al.  SampleCNN: End-to-End Deep Convolutional Neural Networks Using Very Small Filters for Music Classification , 2018 .

[3]  Wei Dai,et al.  Very deep convolutional neural networks for raw waveforms , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Julius O. Smith,et al.  Introduction to Digital Filters: with Audio Applications , 2007 .

[5]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[6]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[7]  Ron J. Weiss,et al.  Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Yundong Zhang,et al.  Hello Edge: Keyword Spotting on Microcontrollers , 2017, ArXiv.

[9]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[11]  Malcolm Slaney,et al.  An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank , 1997 .

[12]  Tara N. Sainath,et al.  Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[14]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[15]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Richard F Lyon,et al.  Cascades of two-pole-two-zero asymmetric resonators are good models of peripheral auditory function. , 2011, The Journal of the Acoustical Society of America.

[17]  Xavier Serra,et al.  A Wavenet for Speech Denoising , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Jimmy J. Lin,et al.  Deep Residual Learning for Small-Footprint Keyword Spotting , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Hermann Ney,et al.  Acoustic modeling with deep neural networks using raw time signal for LVCSR , 2014, INTERSPEECH.

[21]  Roy D. Patterson Auditory models as preprocessors for speech recognition , 1992 .