Spectral Distortion Model for Training Phase-Sensitive Deep-Neural Networks for Far-Field Speech Recognition

In this paper, we present an algorithm which introduces phase-perturbation to the training database when training phase-sensitive deep neural-network models. Traditional features such as log-mel or cepstral features do not have have any phase-relevant information. However features such as raw-waveform or complex spectra features contain phase-relevant information. Phase-sensitive features have the advantage of being able to detect differences in time of arrival across different microphone channels or frequency bands. However, compared to magnitude-based features, phase information is more sensitive to various kinds of distortions such as variations in microphone characteristics, reverberation, and so on. For traditional magnitude-based features, it is widely known that adding noise or reverberation, often called Multistyle-TRaining (MTR), improves robustness. In a similar spirit, we propose an algorithm which introduces spectral distortion to make the deep-learning models more robust to phase-distortion. We call this approach Spectral-Distortion TRaining (SDTR). In our experiments using a training set consisting of 22-million utterances with and without MTR, this approach reduces Word Error Rates (WERs) relatively by 3.2 % and 8.48 % respectively on test sets recorded on Google Home.

[1]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[2]  John H. L. Hansen,et al.  A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition , 2008, Speech Commun..

[3]  Richard M. Stern,et al.  Robust speech recognition using temporal masking and thresholding algorithm , 2014, INTERSPEECH.

[4]  E. A. Martin,et al.  Multi-style training for robust isolated-word speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Guy J. Brown,et al.  Speech segregation based on sound localization , 2003 .

[6]  Richard M. Stern,et al.  Binaural sound source separation motivated by auditory processing , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[8]  Richard M. Stern,et al.  Signal separation for robust speech recognition based on phase difference information obtained in the frequency domain , 2009, INTERSPEECH.

[9]  Tara N. Sainath,et al.  Complex Linear Projection (CLP): A Discriminative Approach to Joint Feature Extraction and Acoustic Modeling , 2016, INTERSPEECH.

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  Tara N. Sainath,et al.  Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[12]  Richard M. Stern,et al.  Automatic selection of thresholds for signal separation algorithms based on interaural delay , 2010, INTERSPEECH.

[13]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[14]  Hyung-Min Park,et al.  Binaural and Multiple-Microphone Signal Processing Motivated by Auditory Perception , 2008, 2008 Hands-Free Speech Communication and Microphone Arrays.

[15]  E. Breitenberger,et al.  Analogues of the normal distribution on the circle and the sphere , 1963 .

[16]  Richard M. Stern,et al.  Nonlinear enhancement of onset for robust speech recognition , 2010, INTERSPEECH.

[17]  Takuya Yoshioka,et al.  Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Tomohiro Nakatani,et al.  Integrating DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[21]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[23]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[24]  Tara N. Sainath,et al.  Raw Multichannel Processing Using Deep Neural Networks , 2017, New Era for Robust Speech Recognition, Exploiting Deep Learning.

[25]  Richard M. Stern,et al.  Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  H. Colburn,et al.  Models of Sound Localization , 2005 .

[27]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[28]  Mitch Weintraub,et al.  Acoustic Modeling for Google Home , 2017, INTERSPEECH.

[29]  Jinyu Li,et al.  Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks. , 2013, ICLR 2013.

[30]  Tara N. Sainath,et al.  Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Tara N. Sainath,et al.  Factored spatial and spectral multichannel raw waveform CLDNNs , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Sree Hari Krishnan Parthasarathi,et al.  Robust Speech Recognition via Anchor Word Representations , 2017, INTERSPEECH.