Spatial correlation model based observation vector clustering and MVDR beamforming for meeting recognition

This paper addresses a minimum variance distortionless response (MVDR) beamforming based speech enhancement approach for meeting speech recognition. In a meeting situation, speaker overlaps and noise signals are not negligible. To handle these issues, we employ MVDR beamforming, where accurate estimation of the steering vector is paramount. We recently found that steering vector estimation by clustering the time-frequency components of microphone observation vectors performs well as regards real-world noise reduction. The clustering is performed by taking a cue from the spatial correlation matrix of each speaker, which is realized by modeling the time-frequency components of the observation vectors with a complex Gaussian mixture model (CGMM). Experimental results with real recordings show that the proposed MVDR scheme outperforms conventional null-beamformer based speech enhancement in a meeting situation.

[1]  S. Furui,et al.  A JAPANESE NATIONAL PROJECT ON SPONTANEOUS SPEECH CORPUS AND PROCESSING TECHNOLOGY , 2003 .

[2]  Futoshi Asano,et al.  Detection and Separation of Speech Events in Meeting Recordings Using a Microphone Array , 2007, EURASIP J. Audio Speech Music. Process..

[3]  Gökhan Tür,et al.  The CALO meeting speech recognition and understanding system , 2008, 2008 IEEE Spoken Language Technology Workshop.

[4]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[5]  Alex Waibel,et al.  MEETING BROWSER: TRACKING AND SUMMARIZING MEETINGS , 2007 .

[6]  Takuya Yoshioka,et al.  Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Andreas Stolcke,et al.  The ICSI Meeting Project: Resources and Research , 2004 .

[8]  Tomohiro Nakatani,et al.  Is speech enhancement pre-processing still relevant when using deep neural networks for acoustic modeling? , 2013, INTERSPEECH.

[9]  Rémi Gribonval,et al.  Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Thomas Hain,et al.  Recognition and understanding of meetings the AMI and AMIDA projects , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[11]  Mike Flynn,et al.  Browsing Recorded Meetings with Ferret , 2004, MLMI.

[12]  Takuya Yoshioka,et al.  Relaxed disjointness based clustering for joint blind source separation and dereverberation , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[13]  Masahito Togami,et al.  Optimized Speech Dereverberation From Probabilistic Perspective for Time Varying Acoustic Transfer Function , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[15]  Tomohiro Nakatani,et al.  Generalization of Multi-Channel Linear Prediction Methods for Blind MIMO Impulse Response Shortening , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Masakiyo Fujimoto,et al.  Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional Camera , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Marijn Huijbregts,et al.  The ICSI RT07s Speaker Diarization System , 2007, CLEAR.