SMS-WSJ: Database, performance measures, and baseline recipe for multi-channel source separation and recognition

We present a multi-channel database of overlapping speech for training, evaluation, and detailed analysis of source separation and extraction algorithms: SMS-WSJ -- Spatialized Multi-Speaker Wall Street Journal. It consists of artificially mixed speech taken from the WSJ database, but unlike earlier databases we consider all WSJ0+1 utterances and take care of strictly separating the speaker sets present in the training, validation and test sets. When spatializing the data we ensure a high degree of randomness w.r.t. room size, array center and rotation, as well as speaker position. Furthermore, this paper offers a critical assessment of recently proposed measures of source separation performance. Alongside the code to generate the database we provide a source separation baseline and a Kaldi recipe with competitive word error rates to provide common ground for evaluation.

[1]  Jonathan Le Roux,et al.  A Purely End-to-End System for Multi-speaker Speech Recognition , 2018, ACL.

[2]  Hirokazu Kameoka,et al.  A review of blind source separation methods: two converging routes to ILRMA originating from ICA and NMF , 2019, APSIPA Transactions on Signal and Information Processing.

[3]  Reinhold Häb-Umbach,et al.  A generic neural acoustic beamforming architecture for robust multi-channel speech processing , 2017, Comput. Speech Lang..

[4]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[5]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[6]  Hiroshi Sawada,et al.  Measuring Dependence of Bin-wise Separated Signals for Permutation Alignment in Frequency-domain BSS , 2007, 2007 IEEE International Symposium on Circuits and Systems.

[7]  Hiroshi Sawada,et al.  Normalized observation vector clustering approach for sparse source separation , 2006, 2006 14th European Signal Processing Conference.

[8]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Reinhold Haeb-Umbach,et al.  An EM Approach to Integrated Multichannel Speech Separation and Noise Suppression , 2010 .

[10]  Jonathan Le Roux,et al.  MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[11]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[12]  Tomohiro Nakatani,et al.  Permutation-free convolutive blind source separation via full-band clustering based on frequency-independent source presence priors , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[14]  H. Gaskell The precedence effect , 1983, Hearing Research.

[15]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Peter Vary,et al.  Multichannel audio database in various acoustic environments , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[18]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[19]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[20]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[21]  Seungjin Choi,et al.  Independent Component Analysis , 2009, Handbook of Natural Computing.

[22]  Daniel P. W. Ellis,et al.  Evaluating Source Separation Algorithms With Reverberant Speech , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Daniel P. W. Ellis,et al.  Model-Based Expectation-Maximization Source Separation and Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Jun Du,et al.  Deep neural network based speech separation for robust speech recognition , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[25]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[27]  Jonathan Le Roux,et al.  WHAM!: Extending Speech Separation to Noisy Environments , 2019, INTERSPEECH.

[28]  Tomohiro Nakatani,et al.  Complex angular central Gaussian mixture model for directional statistics in mask-based microphone array signal processing , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[29]  Jacob Benesty,et al.  On Optimal Frequency-Domain Multichannel Linear Filtering for Noise Reduction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Zhong-Qiu Wang,et al.  Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).