Dover: A Method for Combining Diarization Outputs

Speech recognition and other natural language tasks have long benefited from voting-based algorithms as a method to aggregate outputs from several systems to achieve a higher accuracy than any of the individual systems. Diarization, the task of segmenting an audio stream into speaker-homogeneous and co-indexed regions, has so far not seen the benefit of this strategy because the structure of the task does not lend itself to a simple voting approach. This paper presents DOVER (diarization output voting error reduction), an algorithm for weighted voting among diarization hypotheses, in the spirit of the ROVER algorithm for combining speech recognition hypotheses. We evaluate the algorithm for diarization of meeting recordings with multiple microphones, and find that it consistently reduces diarization error rate over the average of results from individual channels, and often improves on the single best channel chosen by an oracle.

[1]  Andreas Stolcke,et al.  Meeting Transcription Using Virtual Microphone Arrays , 2019, ArXiv.

[2]  Andreas Stolcke,et al.  THE SRI MARCH 2000 HUB-5 CONVERSATIONAL SPEECH TRANSCRIPTION SYSTEM , 2000 .

[3]  Gunnar Evermann,et al.  Posterior probability decoding, confidence estimation and system combination , 2000 .

[4]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[5]  Andreas Stolcke,et al.  Leveraging speaker diarization for meeting recognition from distant microphones , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Mickael Rouvier,et al.  A global optimization framework for speaker diarization , 2012, Odyssey.

[7]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Sue Tranter Two-way cluster voting to improve speaker diarisation performance , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10]  Jonathan G. Fiscus,et al.  The Rich Transcription 2007 Meeting Recognition Evaluation , 2007, CLEAR.

[11]  Takuya Yoshioka,et al.  Exploring Practical Aspects of Neural Mask-Based Beamforming for Far-Field Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Andreas Stolcke,et al.  Making themost from multiple microphones in meeting recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Chao Zhang,et al.  Speaker Diarisation Using 2D Self-attentive Combination of Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Marijn Huijbregts,et al.  The ICSI RT07s Speaker Diarization System , 2007, CLEAR.

[15]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[16]  Jacob Goldberger,et al.  Ensemble Segmentation Using Efficient Integer Linear Programming , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Zhuo Chen,et al.  Meeting Transcription Using Asynchronous Distant Microphones , 2019, INTERSPEECH.