Deep Variational Filter Learning Models for Speech Recognition

We present a novel approach to derive robust speech representations for automatic speech recognition (ASR) systems. The proposed method uses an unsupervised data-driven modulation filter learning approach that preserves the key modulations of speech signal in spectro-temporal domain. This is achieved by a deep generative modeling framework to learn modulation filters using convolutional variational autoencoder (CVAE). A skip connection based CVAE enables the learning of multiple irredundant modulation filters in the time and frequency modulation domain using temporal and spectral trajectories of input spectrograms. The learnt filters are used to process the spectrogram features for ASR training. The ASR experiments are performed on Aurora-4 (additive noise with channel artifact) and CHiME-3 (additive noise with reverberation) databases. The results show significant improvements for the proposed CVAE model over the baseline features as well as other robust front-ends (average relative improvements of 9% in word error rate over baseline features on Aurora-4 database and 23% on CHiME-3 database). In addition, the performance of the proposed features is highly beneficial for semi-supervised training of ASR when reduced amounts of labeled training data are available (average relative improvements of 29% over baseline features on Aurora-4 database with 30% of the labeled training data).

[1]  D. Poeppel,et al.  The cortical organization of speech processing , 2007, Nature Reviews Neuroscience.

[2]  Michael Kleinschmidt,et al.  Localized spectro-temporal features for automatic speech recognition , 2003, INTERSPEECH.

[3]  Sriram Ganapathy,et al.  Comparison of Unsupervised Modulation Filter Learning Methods for ASR , 2018, INTERSPEECH.

[4]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[5]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[6]  Tony Ezzat,et al.  Spectro-temporal analysis of speech using 2-d Gabor filters , 2007, INTERSPEECH.

[7]  Sriram Ganapathy,et al.  Speech Representation Learning Using Unsupervised Data-Driven Modulation Filtering for Robust ASR , 2017, INTERSPEECH.

[8]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[9]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Richard M. Stern,et al.  Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Sarel van Vuuren,et al.  Data-driven design of RASTA-like filters , 1997, EUROSPEECH.

[12]  Frédéric E. Theunissen,et al.  The Modulation Transfer Function for Speech Intelligibility , 2009, PLoS Comput. Biol..

[13]  John H. L. Hansen,et al.  Mean Hilbert Envelope Coefficients (MHEC) for Robust Speaker Recognition , 2012, INTERSPEECH.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Nima Mesgarani,et al.  Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Jeih-Weih Hung,et al.  Optimization of temporal filters for constructing robust features in speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Sriram Ganapathy,et al.  Unsupervised modulation filter learning for noise-robust speech recognition. , 2017, The Journal of the Acoustical Society of America.

[18]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[20]  Jürgen Schmidhuber,et al.  Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction , 2011, ICANN.

[21]  György Kovács,et al.  Selection and enhancement of Gabor filters for automatic speech recognition , 2015, Int. J. Speech Technol..

[22]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.