Collaborative Speech Dereverberation: Regularized Tensor Factorization for Crowdsourced Multi-Channel Recordings

We propose a regularized nonnegative tensor factorization (NTF) model for multi-channel speech derestriction that incorporates prior knowledge about clean speech. The approach models the problem as recovering a signal convolved with different room impulse responses, allowing the dereverberation problem to benefit from microphone arrays. The factorization learns both individual reverberation filters and channel-specific delays, which makes it possible to employ an ad-hoc microphone array with heterogeneous sensors (such as multi-channel recordings by a crowd) even if they are not synchronized. We integrate two prior-knowledge regularization schemes to increase the stability of dereverberation performance. First, a Nonnegative Matrix Factorization (NMF) inner routine is introduced to inform the original NTF problem of the pre-trained clean speech basis vectors, so that the optimization process can focus on estimating their activations rather than the whole clean speech spectra. Second, the NMF activation matrix is further regularized to take on characteristics of dry signals using sparsity and smoothness constraints. Empirical dereverberation results on different simulated reverberation setups show that the prior-knowledge regularization schemes improve both recovered sound quality and speech intelligibility compared to a baseline NTF approach.

[1]  Diana Maria Sima,et al.  Regularization Techniques in Model Fitting and Parameter Estimation (Regularisatietechnieken in modellering en parameterschatting) , 2006 .

[2]  Francisco Javier Ibarrola,et al.  On the use of convolutive nonnegative matrix factorization with mixed penalization for blind speech dereverberation , 2017, 2017 XLIII Latin American Computer Conference (CLEI).

[3]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[4]  Toon van Waterschoot,et al.  A General Framework for Incorporating Time—Frequency Domain Sparsity in Multichannel Speech Dereverberation , 2017 .

[5]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[6]  Frank K. Soong,et al.  Constrained Multichannel Speech Dereverberation , 2012, INTERSPEECH.

[7]  Ina Kodrasi,et al.  Robust sparsity-promoting acoustic multi-channel equalization for speech dereverberation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Francis Bach,et al.  Itakura-Saito nonnegative matrix factorization with group sparsity , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Alexey Ozerov,et al.  Text-informed audio source separation using nonnegative matrix partial co-factorization , 2013, 2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP).

[10]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[11]  Emanuel A. P. Habets,et al.  Late Reverberant Spectral Variance Estimation Based on a Statistical Model , 2009, IEEE Signal Processing Letters.

[12]  J. Hansen,et al.  Multichannel speech dereverberation based on convolutive nonnegative tensor factorization for ASR applications , 2014, INTERSPEECH.

[13]  Nicolas Gillis,et al.  Sparse and unique nonnegative matrix factorization through data preprocessing , 2012, J. Mach. Learn. Res..

[14]  Simon Doclo,et al.  Speech Dereverberation Using Non-Negative Convolutive Transfer Function and Spectro-Temporal Modeling , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Hirokazu Kameoka,et al.  Formulations and algorithms for multichannel complex NMF , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Paris Smaragdis,et al.  Efficient neighborhood-based topic modeling for collaborative audio enhancement on massive crowdsourced recordings , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Bhiksha Raj,et al.  Latent-variable decomposition based dereverberation of monaural and multi-channel signals , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Paris Smaragdis,et al.  Collaborative audio enhancement using probabilistic latent component sharing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Ali Taylan Cemgil,et al.  Link prediction in heterogeneous data via generalized coupled tensor factorization , 2013, Data Mining and Knowledge Discovery.

[20]  Athanasios Mouchtaris,et al.  Maximum component elimination in mixing of user generated audio recordings , 2017, 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP).

[21]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[22]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[23]  Hirokazu Kameoka,et al.  Robust speech dereverberation based on non-negativity and sparse nature of speech spectrograms , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Biing-Hwang Juang,et al.  Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Paris Smaragdis,et al.  Optimal cost function and magnitude power for NMF-based speech separation and music interpolation , 2012, 2012 IEEE International Workshop on Machine Learning for Signal Processing.

[26]  L. Rudin,et al.  Nonlinear total variation based noise removal algorithms , 1992 .

[27]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .