An EM algorithm for joint source separation and diarisation of multichannel convolutive speech mixtures

We present a probabilistic model for joint source separation and diarisation of multichannel convolutive speech mixtures. We build upon the framework of local Gaussian model (LGM) with non-negative matrix factorization (NMF). The diarisation is introduced as a temporal labeling of each source in the mix as active or inactive at the short-term frame level. We devise an EM algorithm in which the source separation process is aided by the diarisation state, since the latter indicates the sources actually present in the mixture. The diarisation state is tracked with a Hidden Markov Model (HMM) with emission probabilities calculated from the estimated source signals. The proposed EM has separation performance comparable with a state-of-the-art LGM NMF method, while outperforming a state-of-the-art speaker diarisation pipeline.

[1]  Fabio Valente,et al.  Multistream speaker diarization of meetings recordings beyond MFCC and TDOA features , 2012, Speech Commun..

[2]  Pierre Comon,et al.  Handbook of Blind Source Separation: Independent Component Analysis and Applications , 2010 .

[3]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Emmanuel Vincent,et al.  A General Flexible Framework for the Handling of Prior Information in Audio Source Separation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Bhiksha Raj,et al.  Non-negative Hidden Markov Modeling of Audio with Application to Source Separation , 2010, LVA/ICA.

[6]  Radu Horaud,et al.  A Variational EM Algorithm for the Separation of Time-Varying Convolutive Audio Mixtures , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Tim Brookes,et al.  A Comparison of Computational Precedence Models for Source Separation in Reverberant Environments , 2013 .

[8]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[9]  James L. Massey,et al.  Proper complex random processes with applications to information theory , 1993, IEEE Trans. Inf. Theory.

[10]  Alexey Ozerov,et al.  Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Rémi Gribonval,et al.  Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Lucas C. Parra,et al.  Convolutive blind separation of non-stationary sources , 2000, IEEE Trans. Speech Audio Process..

[13]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[14]  Hirokazu Kameoka,et al.  Underdetermined blind separation and tracking of moving sources based ONDOA-HMM , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Hirokazu Kameoka,et al.  Unified approach for audio source separation with multichannel factorial HMM and DOA mixture model , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[16]  Rémi Gribonval,et al.  Non negative sparse representation for Wiener based source separation with a single sensor , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[17]  Nicolas Sturmel,et al.  Linear Mixing Models for Active Listening of Music Productions in Realistic Studio Conditions , 2012 .

[18]  Hirokazu Kameoka,et al.  A unified approach for underdetermined blind signal separation and source activity detection by multichannel factorial hidden Markov models , 2014, INTERSPEECH.

[19]  Paris Smaragdis,et al.  Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Pierre Vandergheynst,et al.  Nonnegative matrix factorization and spatial covariance model for under-determined reverberant audio source separation , 2010, 10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010).

[21]  Nancy Bertin,et al.  Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis , 2009, Neural Computation.

[22]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[24]  Mark D. Plumbley,et al.  Probabilistic Modeling Paradigms for Audio Source Separation , 2010 .

[25]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.