Speech Enhancement Based on Bayesian Low-Rank and Sparse Decomposition of Multichannel Magnitude Spectrograms

This paper presents a blind multichannel speech enhancement method that can deal with the time-varying layout of microphones and sound sources. Since nonnegative tensor factorization (NTF) separates a multichannel magnitude (or power) spectrogram into source spectrograms without phase information, it is robust against the time-varying mixing system. This method, however, requires prior information such as the spectral bases (templates) of each source spectrogram in advance. To solve this problem, we develop a Bayesian model called robust NTF (Bayesian RNTF) that decomposes a multichannel magnitude spectrogram into target speech and noise spectrograms based on their sparseness and low rankness. Bayesian RNTF is applied to the challenging task of speech enhancement for a microphone array distributed on a hose-shaped rescue robot. When the robot searches for victims under collapsed buildings, the layout of the microphones changes over time and some of them often fail to capture target speech. Our method robustly works under such situations, thanks to its characteristic of time-varying mixing system. Experiments using a 3-m hose-shaped rescue robot with eight microphones show that the proposed method outperforms conventional blind methods in enhancement performance by the signal-to-noise ratio of 1.03 dB.

[1]  Eijiro Takeuchi,et al.  Remote vertical exploration by Active Scope Camera into collapsed buildings , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[2]  Takeshi Yamada,et al.  Ego Noise Reduction for Hose-Shaped Rescue Robot Combining Independent Low-Rank Matrix Analysis and Multichannel Noise Cancellation , 2016, LVA/ICA.

[3]  Satoshi Tadokoro,et al.  Low Latency and High Quality Two-Stage Human-Voice-Enhancement System for a Hose-Shaped Rescue Robot , 2017, J. Robotics Mechatronics.

[4]  Nicolas Dobigeon,et al.  Nonlinear Hyperspectral Unmixing With Robust Nonnegative Matrix Factorization , 2014, IEEE Transactions on Image Processing.

[5]  Hirokazu Kameoka,et al.  Reverberation-robust underdetermined source separation with non-negative tensor double deconvolution , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[6]  D. Gamerman,et al.  A NON‐GAUSSIAN FAMILY OF STATE‐SPACE MODELS WITH EXACT MARGINAL LIKELIHOOD , 2013 .

[7]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Alexey Ozerov,et al.  Notes on Nonnegative Tensor Factorization of the Spectrogram for Audio Source Separation: Statistical Insights and Towards Self-Clustering of the Spatial Cues , 2010, CMMR.

[9]  Radu Horaud,et al.  A variational EM algorithm for the separation of moving sound sources , 2015, 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[10]  Derry Fitzgerald,et al.  Sound Source Separation Using Shifted Non-Negative Tensor Factorisation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[12]  Shuichi Itahashi,et al.  The design of the newspaper-based Japanese large vocabulary continuous speech recognition corpus , 1998, ICSLP.

[13]  Aggelos K. Katsaggelos,et al.  Sparse Bayesian Methods for Low-Rank Matrix Estimation , 2011, IEEE Transactions on Signal Processing.

[14]  Lawrence Carin,et al.  Bayesian Robust Principal Component Analysis , 2011, IEEE Transactions on Image Processing.

[15]  Daniel P. W. Ellis,et al.  Speech enhancement by sparse, low-rank, and dictionary spectrogram decomposition , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[16]  Takeshi Yamada,et al.  Amplitude-based speech enhancement with nonnegative matrix factorization for asynchronous distributed recording , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[17]  Keisuke Nakamura,et al.  Assessment of general applicability of ego noise estimation , 2011, 2011 IEEE International Conference on Robotics and Automation.

[18]  Ali Taylan Cemgil,et al.  Conjugate Gamma Markov Random Fields for Modelling Nonstationary Sources , 2007, ICA.

[19]  Ali Taylan Cemgil,et al.  Bayesian Inference for Nonnegative Matrix Factorisation Models , 2009, Comput. Intell. Neurosci..

[20]  Meng Sun,et al.  Speech enhancement based on robust NMF solved by alternating direction method of multipliers , 2015, 2015 IEEE 17th International Workshop on Multimedia Signal Processing (MMSP).

[21]  Walter Kellermann,et al.  Phase-optimized K-SVD for signal extraction from underdetermined multichannel sparse mixtures , 2014, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Vianney Perchet,et al.  Gaussian Process Optimization with Mutual Information , 2013, ICML.

[23]  Andre Wibisono,et al.  Streaming Variational Bayes , 2013, NIPS.

[24]  Paris Smaragdis,et al.  Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Stefan Goetze,et al.  Reduction of Non-stationary Noise for a Robotic Living Assistant using Sparse Non-negative Matrix Factorization , 2012, SMIAE@ACL.

[26]  Qin Zhang,et al.  Noise Reduction Based on Robust Principal Component Analysis , 2014 .

[27]  Taesu Kim,et al.  Real-Time Independent Vector Analysis for Convolutive Blind Source Separation , 2010, IEEE Transactions on Circuits and Systems I: Regular Papers.

[28]  Tuomas Virtanen,et al.  Ieee Transactions on Audio, Speech and Language Processing Direction of Arrival Based Spatial Covariance Model for Blind Sound Source Separation , 2022 .

[29]  John Wright,et al.  Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization , 2009, NIPS.

[30]  Meng Sun,et al.  Speech Enhancement Under Low SNR Conditions Via Noise Estimation Using Sparse and Low-Rank NMF with Kullback–Leibler Divergence , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[32]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[33]  Hirokazu Kameoka,et al.  Efficient multichannel nonnegative matrix factorization exploiting rank-1 spatial model , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Arvind Ganesh,et al.  Fast Convex Optimization Algorithms for Exact Recovery of a Corrupted Low-Rank Matrix , 2009 .

[35]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[36]  Nobutaka Ono,et al.  Stable and fast update rules for independent vector analysis based on auxiliary function technique , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[37]  Kazuhiro Nakadai,et al.  An easily-configurable robot audition system using Histogram-based Recursive Level Estimation , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[38]  Satoshi Tadokoro,et al.  Human-voice enhancement based on online RPCA for a hose-shaped rescue robot with a microphone array , 2015, 2015 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR).

[39]  Alexey Ozerov,et al.  Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Satoshi Tadokoro,et al.  Variational Bayesian multi-channel robust NMF for human-voice enhancement with a deformable and partially-occluded microphone array , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[41]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Tomohiro Nakatani,et al.  Spatial correlation model based observation vector clustering and MVDR beamforming for meeting recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Matthew D. Hoffman Poisson-uniform nonnegative matrix factorization , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[45]  Meng Sun,et al.  Mask estimate through Itakura-Saito nonnegative RPCA for speech enhancement , 2016, 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC).

[46]  Nicolas Dobigeon,et al.  Robust nonnegative matrix factorization for nonlinear unmixing of hyperspectral images , 2013, 2013 5th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS).

[47]  Toshihiko Nishimura,et al.  Application of Active Scope Camera to forensic investigation of construction accident , 2009, 2009 IEEE Workshop on Advanced Robotics and its Social Impacts.