Learning Filterbanks from Raw Speech for Phone Recognition

We train a bank of complex filters that operates on the raw waveform and is fed into a convolutional neural network for end-to-end phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation of mel-filterbanks, and then fine-tuned jointly with the remaining convolutional architecture. We perform phone recognition experiments on TIMIT and show that for several architectures, models trained on TD- filterbanks consistently outperform their counterparts trained on comparable mel-filterbanks. We get our best performance by learning all front-end steps, from pre-emphasis up to averaging. Finally, we observe that the filters at convergence have an asymmetric impulse response, and that some of them remain almost analytic.

[1]  J. L. Flanagan,et al.  Parametric coding of speech spectra , 1980 .

[2]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[3]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[4]  Michael S. Lewicki,et al.  Efficient auditory coding , 2006, Nature.

[5]  Dimitri Palaz,et al.  Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks , 2013, INTERSPEECH.

[6]  Ron J. Weiss,et al.  Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Dimitri Palaz,et al.  End-to-end Phoneme Sequence Recognition using Convolutional Neural Networks , 2013, ArXiv.

[8]  Maarten Versteegh,et al.  A deep scattering spectrum — Deep Siamese network pipeline for unsupervised acoustic modeling , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Joakim Andén,et al.  Deep Scattering Spectrum , 2013, IEEE Transactions on Signal Processing.

[10]  Gabriel Synnaeve,et al.  Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[11]  Ying Zhang,et al.  Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks , 2016, INTERSPEECH.

[12]  Tara N. Sainath,et al.  Deep Scattering Spectrum with deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[14]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[16]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Satoshi Nakamura,et al.  Attention-based Wav2Text with feature transfer learning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[18]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[19]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[20]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[21]  Liang Lu,et al.  Segmental Recurrent Neural Networks for End-to-End Speech Recognition , 2016, INTERSPEECH.

[22]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[23]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[24]  László Tóth Phone recognition with hierarchical convolutional deep maxout networks , 2015, EURASIP J. Audio Speech Music. Process..