Online MVDR Beamformer Based on Complex Gaussian Mixture Model With Spatial Prior for Noise Robust ASR

This paper considers acoustic beamforming for noise robust automatic speech recognition. A beamformer attenuates background noise by enhancing sound components coming from a direction specified by a steering vector. Hence, accurate steering vector estimation is paramount for successful noise reduction. Recently, time–frequency masking has been proposed to estimate the steering vectors that are used for a beamformer. In particular, we have developed a new form of this approach, which uses a speech spectral model based on a complex Gaussian mixture model (CGMM) to estimate the time–frequency masks needed for steering vector estimation, and extended the CGMM-based beamformer to an online speech enhancement scenario. Our previous experiments showed that the proposed CGMM-based approach outperforms a recently proposed mask estimator based on a Watson mixture model and the baseline speech enhancement system of the CHiME-3 challenge. This paper provides additional experimental results for our online processing, which achieves performance comparable to that of batch processing with a suitable block-batch size. This online version reduces the CHiME-3 word error rate (WER) on the evaluation set from 8.37% to 8.06%. Moreover, in this paper, we introduce a probabilistic prior distribution for a spatial correlation matrix (a CGMM parameter), which enables more stable steering vector estimation in the presence of interfering speakers. In practice, the performance of the proposed online beamformer degrades with observations that contain only noise or/and interference because of the failure of the CGMM parameter estimation. The introduced spatial prior enables the target speaker's parameter to avoid overfitting to noise or/and interference. Experimental results show that the spatial prior reduces the WER from 38.4% to 29.2% in a conversation recognition task compared with the CGMM-based approach without the prior, and outperforms a conventional online speech enhancement approach.

[1]  Tomohiro Nakatani,et al.  Is speech enhancement pre-processing still relevant when using deep neural networks for acoustic modeling? , 2013, INTERSPEECH.

[2]  Jacob Benesty,et al.  On Optimal Frequency-Domain Multichannel Linear Filtering for Noise Reduction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Rémi Gribonval,et al.  Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Shigeru Katagiri,et al.  Cumulative moving averaged bottleneck speaker vectors for online speaker adaptation of CNN-based acoustic models , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Tomohiro Nakatani,et al.  Unsupervised discriminative adaptation using differenced maximum mutual information based linear regression , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Tomohiro Nakatani,et al.  Speaker-Aware Neural Network Based Beamformer for Speaker Extraction in Speech Mixtures , 2017, INTERSPEECH.

[7]  Marc Delcroix,et al.  Joint acoustic factor learning for robust deep neural network based automatic speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Lukás Burget,et al.  Empirical Evaluation and Combination of Advanced Language Modeling Techniques , 2011, INTERSPEECH.

[9]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[10]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[11]  Akihiko Sugiyama,et al.  A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters , 1999, IEEE Trans. Signal Process..

[12]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[13]  Hiroshi Sawada,et al.  Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[15]  Masakiyo Fujimoto,et al.  Speaker indexing and speech enhancement in real meetings / conversations , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Tara N. Sainath,et al.  Deep Convolutional Neural Networks for Large-scale Speech Tasks , 2015, Neural Networks.

[17]  Dietrich Klakow,et al.  Beamforming With a Maximum Negentropy Criterion , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Takuya Yoshioka,et al.  Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Hiroshi Sawada,et al.  Solving the Permutation Problem of Frequency-Domain BSS when Spatial Aliasing Occurs with Wide Sensor Spacing , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[20]  Paris Smaragdis,et al.  Blind separation of convolved mixtures in the frequency domain , 1998, Neurocomputing.

[21]  Mark J. F. Gales,et al.  Impact of single-microphone dereverberation on DNN-based meeting transcription systems , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Hiroshi Sawada,et al.  A Multichannel MMSE-Based Framework for Speech Source Separation and Noise Reduction , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[24]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Shoko Araki,et al.  Meeting recognition with asynchronous distributed microphone array , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[26]  Florian Metze,et al.  New Era for Robust Speech Recognition , 2017, Springer International Publishing.

[27]  Tomohiro Nakatani,et al.  Context adaptive deep neural networks for fast acoustic model adaptation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Steve Renals,et al.  Hybrid acoustic models for distant and multichannel large vocabulary speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[29]  Shinji Watanabe,et al.  Discriminative approach to dynamic variance adaptation for noisy speech recognition , 2011, 2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays.

[30]  Tomohiro Nakatani,et al.  Learning speaker representation for neural network based multichannel speaker extraction , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[31]  Reinhold Häb-Umbach,et al.  Blind speech separation employing directional statistics in an Expectation Maximization framework , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Shinji Watanabe,et al.  Variance Compensation for Recognition of Reverberant Speech with Dereverberation Preprocessing , 2011, Robust Speech Recognition of Uncertain or Missing Data.

[33]  Atsuo Hiroe,et al.  Solution of Permutation Problem in Frequency Domain ICA, Using Multivariate Probability Density Functions , 2006, ICA.

[34]  Radoslaw Mazur,et al.  An Approach for Solving the Permutation Problem of Convolutive Blind Source Separation Based on Statistical Signal Models , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Daniel P. W. Ellis,et al.  An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments , 2006, NIPS.

[36]  Masakiyo Fujimoto,et al.  Speech recognition in the presence of highly non-stationary noise based on spatial, spectral and temporal speech/noise modeling combined with dynamic variance adaptation , 2011 .

[37]  Tomohiro Nakatani,et al.  Dynamic variance adaptation using differenced maximum mutual information , 2012, MLSLP.

[38]  Tomohiro Nakatani,et al.  Online environmental adaptation of CNN-based acoustic models using spatial diffuseness features , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Hiroshi Sawada,et al.  A robust and precise method for solving the permutation problem of frequency-domain blind source separation , 2004, IEEE Transactions on Speech and Audio Processing.

[40]  Takuya Yoshioka,et al.  Relaxed disjointness based clustering for joint blind source separation and dereverberation , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[41]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[42]  Tomohiro Nakatani,et al.  Inverse Filtering for Speech Dereverberation Without the Use of Room Acoustics Information , 2010, Speech Dereverberation.

[43]  Masakiyo Fujimoto,et al.  Strategies for distant speech recognitionin reverberant environments , 2015, EURASIP J. Adv. Signal Process..

[44]  Dorothea Kolossa,et al.  Missing feature speech recognition in a meeting situation with maximum SNR beamforming , 2008, 2008 IEEE International Symposium on Circuits and Systems.

[45]  Rémi Gribonval,et al.  Spatial location priors for Gaussian model based reverberant audio source separation , 2013, EURASIP J. Adv. Signal Process..

[46]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[47]  Shinji Watanabe,et al.  Combined static and dynamic variance adaptation for efficient interconnection of speech enhancement pre-processor with speech recognizer , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[48]  Masakiyo Fujimoto,et al.  Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional Camera , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[49]  Hiroshi Sawada,et al.  Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors , 2007, Signal Process..

[50]  L. J. Griffiths,et al.  An alternative approach to linearly constrained adaptive beamforming , 1982 .

[51]  Chengzhu Yu,et al.  Context adaptive deep neural networks for fast acoustic model adaptation in noisy conditions , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52]  Tomohiro Nakatani,et al.  Spatial correlation model based observation vector clustering and MVDR beamforming for meeting recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Masakiyo Fujimoto,et al.  LINEAR PREDICTION-BASED DEREVERBERATION WITH ADVANCED SPEECH ENHANCEMENT AND RECOGNITION TECHNOLOGIES FOR THE REVERB CHALLENGE , 2014 .

[54]  Shinji Watanabe,et al.  Discriminative feature transforms using differenced maximum mutual information , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  O. Hoshuyama,et al.  A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[56]  Tomohiro Nakatani,et al.  Context Adaptive Neural Network for Rapid Adaptation of Deep CNN Based Acoustic Models , 2016, INTERSPEECH.