A Comparison and Combination of Unsupervised Blind Source Separation Techniques

Unsupervised blind source separation methods do not require a training phase and thus cannot suffer from a train-test mismatch, which is a common concern in neural network based source separation. The unsupervised techniques can be categorized in two classes, those building upon the sparsity of speech in the Short-Time Fourier transform domain and those exploiting nonGaussianity or non-stationarity of the source signals. In this contribution, spatial mixture models which fall in the first category and independent vector analysis (IVA) as a representative of the second category are compared w.r.t. their separation performance and the performance of a downstream speech recognizer on a reverberant dataset of reasonable size. Furthermore, we introduce a serial concatenation of the two, where the result of the mixture model serves as initialization of IVA, which achieves significantly better WER performance than each algorithm individually and even approaches the performance of a much more complex neural network based technique.

[1]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Tomohiro Nakatani,et al.  Convolutive Transfer Function Invariant SDR Training Criteria for Multi-Channel Reverberant Speech Separation , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[4]  Hiroshi Sawada,et al.  Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Reinhold Häb-Umbach,et al.  Blind speech separation employing directional statistics in an Expectation Maximization framework , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Reinhold Haeb-Umbach,et al.  NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing , 2018, ITG Symposium on Speech Communication.

[7]  Biing-Hwang Juang,et al.  Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Tomohiro Nakatani,et al.  Permutation-free convolutive blind source separation via full-band clustering based on frequency-independent source presence priors , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Atsuo Hiroe,et al.  Solution of Permutation Problem in Frequency Domain ICA, Using Multivariate Probability Density Functions , 2006, ICA.

[10]  Te-Won Lee,et al.  Independent Vector Analysis: An Extension of ICA to Multivariate Components , 2006, ICA.

[11]  Nobutaka Ono,et al.  Fast Stereo Independent Vector Analysis and its Implementation on Mobile Phone , 2012, IWAENC.

[12]  Reinhold Haeb-Umbach,et al.  Integration of Neural Networks and Probabilistic Spatial Models for Acoustic Blind Source Separation , 2019, IEEE Journal of Selected Topics in Signal Processing.

[13]  Israel Cohen,et al.  On Multiplicative Transfer Function Approximation in the Short-Time Fourier Transform Domain , 2007, IEEE Signal Processing Letters.

[14]  Andrzej Cichocki,et al.  A New Learning Algorithm for Blind Signal Separation , 1995, NIPS.

[15]  Reinhold Haeb-Umbach,et al.  SMS-WSJ: Database, performance measures, and baseline recipe for multi-channel source separation and recognition , 2019, ArXiv.

[16]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[17]  Walter Kellermann,et al.  TRINICON: a versatile framework for multichannel blind signal processing , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Benedikt Loesch,et al.  Cramér-Rao Bound for Circular and Noncircular Complex Independent Component Analysis , 2013, IEEE Transactions on Signal Processing.

[19]  James L. Massey,et al.  Proper complex random processes with applications to information theory , 1993, IEEE Trans. Inf. Theory.

[20]  Nobutaka Ono,et al.  Independent Vector Analysis with More Microphones Than Sources , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[21]  Arie Yeredor On hybrid exact-approximate joint diagonalization , 2009, 2009 3rd IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP).

[22]  Nobutaka Ono,et al.  Auxiliary-Function-Based Independent Component Analysis for Super-Gaussian Sources , 2010, LVA/ICA.

[23]  Nobutaka Ono,et al.  Stable and fast update rules for independent vector analysis based on auxiliary function technique , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[24]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[25]  DeLiang Wang,et al.  Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Tomohiro Nakatani,et al.  Complex angular central Gaussian mixture model for directional statistics in mask-based microphone array signal processing , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[27]  Jacob Benesty,et al.  On Optimal Frequency-Domain Multichannel Linear Filtering for Noise Reduction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  K. Matsuoka,et al.  Minimal distortion principle for blind source separation , 2002, Proceedings of the 41st SICE Annual Conference. SICE 2002..

[29]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[30]  Japan,et al.  Far-Field Automatic Speech Recognition , 2020, Proceedings of the IEEE.

[31]  Hirokazu Kameoka,et al.  A review of blind source separation methods: two converging routes to ILRMA originating from ICA and NMF , 2019, APSIPA Transactions on Signal and Information Processing.