Adaptive and online speaker diarization for meeting data

Speaker diarization aims to determine `who spoke when' in a given audio stream. Different applications, such as document structuring or information retrieval have led to the exploration of speaker diarization in many different domains, from broadcast news to lectures, phone conversations and meetings. Almost all current diarization systems are offline and ill-suited to the growing need for online or real-time diarization, stemming from the increasing popularity of powerful, mobile smart devices. While a small number of such systems have been reported, truly online diarization systems for challenging and highly spontaneous meeting data are lacking. This paper reports our work to develop an adaptive and online diarization system using the NIST Rich Transcription meetings corpora. While not dissimilar to those previously reported for less challenging domains, high diarization error rates illustrate the challenge ahead and lead to some ideas to improve performance through future research.

[1]  Daben Liu,et al.  Online speaker clustering , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Masakiyo Fujimoto,et al.  Online meeting recognizer with multichannel speaker diarization , 2010, 2010 Conference Record of the Forty Fourth Asilomar Conference on Signals, Systems and Computers.

[3]  Nicholas W. D. Evans,et al.  The lia-eurecom RT'09 speaker diarization system: Enhancements in speaker modelling and cluster purification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Gerald Friedland,et al.  The ICSI RT-09 Speaker Diarization System , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Gerald Friedland,et al.  A hybrid approach to online speaker diarization , 2010, INTERSPEECH.

[7]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Shoei Sato,et al.  Low-latency speaker diarization based on Bayesian information criterion with multiple phoneme classes , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Gerhard Rigoll,et al.  GMM-UBM based open-set online speaker diarization , 2010, INTERSPEECH.

[10]  Nicholas W. D. Evans,et al.  Phone adaptive training for short-duration speaker verification , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[11]  Daben Liu,et al.  Online speaker clustering , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[12]  Satoshi Nakamura,et al.  Never-ending learning system for on-line speaker diarization , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[13]  Nicholas W. D. Evans,et al.  Phone Adaptive Training for Speaker Diarization , 2012, INTERSPEECH.

[14]  Nicholas W. D. Evans,et al.  Short-Duration Speaker Modelling with Phone Adaptive Training , 2014, Odyssey.

[15]  Masakiyo Fujimoto,et al.  Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional Camera , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Satoshi Nakamura,et al.  Improved novelty detection for online GMM based speaker diarization , 2008, INTERSPEECH.

[17]  Ieee Staff 2017 25th European Signal Processing Conference (EUSIPCO) , 2017 .

[18]  Dong Wang,et al.  A Comparative Study of Bottom-Up and Top-Down Approaches to Speaker Diarization , 2012, IEEE Transactions on Audio, Speech, and Language Processing.