DOVER-Lap: A Method for Combining Overlap-Aware Diarization Outputs

Several advances have been made recently towards handling overlapping speech for speaker diarization. Since speech and natural language tasks often benefit from ensemble techniques, we propose an algorithm for combining outputs from such diarization systems through majority voting. Our method, DOVER-Lap, is inspired from the recently proposed DOVER algorithm, but is designed to handle overlapping segments in diarization outputs. We also modify the pair-wise incremental label mapping strategy used in DOVER, and propose an approximation algorithm based on weighted k-partite graph matching, which performs this mapping using a global cost tensor. We demonstrate the strength of our method by combining outputs from diverse systems -- clustering-based, region proposal networks, and target-speaker voice activity detection -- on AMI and LibriCSS datasets, where it consistently outperforms the single best system. Additionally, we show that DOVER-Lap can be used for late fusion in multichannel diarization, and compares favorably with early fusion methods like beamforming.

[1]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[2]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[4]  Mireia Díez,et al.  Speaker Diarization based on Bayesian HMM with Eigenvoice Priors , 2018, Odyssey.

[5]  Naoyuki Kanda,et al.  Microsoft Speaker Diarization System for the Voxceleb Speaker Recognition Challenge 2020 , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Shinji Watanabe,et al.  End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification , 2020, ArXiv.

[7]  Reinhold Haeb-Umbach,et al.  NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing , 2018, ITG Symposium on Speech Communication.

[8]  Leibny Paola García-Perera,et al.  Overlap-Aware Diarization: Resegmentation Using Neural End-to-End Overlapped Speech Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Valentin Andrei,et al.  Detecting Overlapped Speech on Short Timeframes Using Deep Learning , 2017, INTERSPEECH.

[10]  Giorgio Gambosi,et al.  Complexity and approximation: combinatorial optimization problems and their approximability properties , 1999 .

[11]  Alan McCree,et al.  Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Gunnar Evermann,et al.  Posterior probability decoding, confidence estimation and system combination , 2000 .

[13]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Shrikanth Narayanan,et al.  Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap , 2020, IEEE Signal Processing Letters.

[17]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  David A. van Leeuwen,et al.  Speech overlap detection in a two-pass speaker diarization system , 2009, INTERSPEECH.

[19]  Heiga Zen,et al.  Speech Processing for Digital Home Assistants: Combining signal processing with deep-learning techniques , 2019, IEEE Signal Processing Magazine.

[20]  Viggo Kann,et al.  Maximum Bounded 3-Dimensional Matching is MAX SNP-Complete , 1991, Inf. Process. Lett..

[21]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[22]  Fabio Valente,et al.  Speaker diarization of overlapping speech based on silence distribution in meeting recordings , 2012, INTERSPEECH.

[23]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[24]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[25]  L. Burget,et al.  on Bayesian HMM with Eigenvoice Priors , 2018 .

[26]  Björn W. Schuller,et al.  Detecting overlapping speech with long short-term memory recurrent neural networks , 2013, INTERSPEECH.

[27]  Biing-Hwang Juang,et al.  Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Lukás Burget,et al.  The AMI System for the Transcription of Speech in Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[29]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Zhuo Chen,et al.  Continuous Speech Separation: Dataset and Analysis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Marie Kunesová,et al.  Detection of Overlapping Speech for the Purposes of Speaker Diarization , 2019, SPECOM.

[32]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Ismail Hakki Toroslu,et al.  Incremental assignment problem , 2007, Inf. Sci..

[34]  Jon Barker,et al.  CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[35]  Jun Du,et al.  Speaker Diarization with Enhancing Speech for the First DIHARD Challenge , 2018, INTERSPEECH.

[36]  Daniel Garcia-Romero,et al.  Diarization resegmentation in the factor analysis subspace , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Zhuo Chen,et al.  Meeting Transcription Using Asynchronous Distant Microphones , 2019, INTERSPEECH.

[38]  Björn W. Schuller,et al.  Enhancing LSTM RNN-Based Speech Overlap Detection by Artificially Mixed Data , 2017, Semantic Audio.

[39]  Andreas Stolcke,et al.  Dover: A Method for Combining Diarization Outputs , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[40]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Andreas Stolcke,et al.  THE SRI MARCH 2000 HUB-5 CONVERSATIONAL SPEECH TRANSCRIPTION SYSTEM , 2000 .

[42]  Shinji Watanabe,et al.  Speaker Diarization with Region Proposal Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Desh Raj,et al.  The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[44]  Gerald Friedland,et al.  Overlapped speech detection for improved speaker diarization in multiparty meetings , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[45]  Aleksei Romanenko,et al.  Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario , 2020, INTERSPEECH.

[46]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[47]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[48]  Desh Raj,et al.  Multi-class Spectral Clustering with Overlaps for Speaker Diarization , 2020, ArXiv.