Speaker Diarization: Current Limitations and New Directions

Author(s): Knox, Mary Tai | Advisor(s): Morgan, Nelson | Abstract: Speaker diarization is the problem of determining "who spoke when" in an audio recording when the number and identities of the speakers are unknown. Motivated by applications in automatic speech recognition and audio indexing, speaker diarization has been studied extensively over the past decade, and there are currently a wide variety of approaches - including both top-down and bottom-up unsupervised clustering methods. The contributions of this thesis are to provide a unified analysis of the current state-of-the-art, to understand where and why mistakes occur, and to identify directions for improvements.In the first part of the thesis, we analyze the behavior of six state-of-the-art diarization systems, all evaluated on the National Institute of Standards and Technology (NIST) Rich Transcription 2009 evaluation dataset. While performance is typically assessed in terms of a single number - the diarization error rate (DER) - we further characterize the errors based on speech segment durations and their proximity to speaker change points. It is shown that for all of the systems, performance degrades both as the segment duration decreases and as the proximity to the speaker change point increases. Although short segments are problematic, their overall impact on the DER is small since the majority of scored time occurs in segments greater than 2.5 seconds. By contrast, the amount of time near speaker change points is relatively high, and thus poor performance near these change points contributes significantly to the DER. For example, for the single distant microphone (SDM) and multiple distant microphone (MDM) conditions, over 33% and 40% of the errors occur within 0.5 seconds of a change point for all evaluated systems, respectively.In the next part of the thesis, we focus on the International Computer Science Institute (ICSI) speaker diarization system and explore the effects of various system modifications. This system contains many steps - including speech activity detection, initialization, speaker segmentation, and speaker clustering. Inspired by our previous analysis, we focus on modifi- cations that improve performance near speaker change points. We first implement an alter- native to the minimum duration constraint, which sets the shortest amount of speech time before a speaker change can occur. This modification results in a 12% relative improvement of the speaker error rate for the MDM condition, with the largest improvement occurring closest to the speaker change point, and a 3% relative improvement for the SDM condition. Next, we show how the difference between the largest and second largest log-likelihood scores provides valuable information for unsupervised clustering, namely it indicates which regions of the output are likely correct.Lastly, we explore the potential of applying speaker diarization methodologies to other applications. Specifically, we investigate the use of a diarization-based algorithm for the problem of duplication detection, where the goal is to determine whether a given query (e.g., a short audio clip) has been taken from a reference set (e.g., a large collection of copyrighted media). With minimal modifications of the ICSI diarization system, we are able to obtain moderate performance. However, our approach is not competitive with existing approaches designed specifically for the problem of duplication detection, and the extent to which diarization-based approaches are useful for this application remains an open question.

[1]  Haizhou Li,et al.  T-test distance and clustering criterion for speaker diarization , 2008, INTERSPEECH.

[2]  Gerald Friedland,et al.  Estimating Dominance in Multi-Party Meetings Using Speaker Diarization , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Marijn Huijbregts,et al.  Segmentation, diarization and speech transcription : surprise data unraveled , 2008 .

[4]  Nicholas W. D. Evans,et al.  System output combination for improved speaker diarization , 2010, INTERSPEECH.

[5]  Nikki Mirghafori,et al.  Nuts and Flakes: a Study of Data Characteristics in Speaker Diarization , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[6]  James R. Glass,et al.  On the Use of Spectral and Iterative Methods for Speaker Diarization , 2012, INTERSPEECH.

[7]  Yuhua Jiao,et al.  MDCT-Based Perceptual Hashing for Compressed Audio Content Identification , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[8]  Henrique S. Malvar,et al.  Using audio fingerprinting for duplicate detection and thumbnail generation , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[9]  Gerald Friedland,et al.  The ICSI RT-09 Speaker Diarization System , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Michael I. Jordan,et al.  A Sticky HDP-HMM With Application to Speaker Diarization , 2009, 0905.2592.

[11]  Antonio Garzon,et al.  MASK: Robust Local Features for Audio Fingerprinting , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[12]  Nuria Oliver,et al.  Telefonica Research at TRECVID 2010 Content-Based Copy Detection , 2010, TRECVID.

[13]  Jean-Luc Gauvain,et al.  Partitioning and transcription of broadcast news data , 1998, ICSLP.

[14]  M. A. Siegler,et al.  Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[15]  Po-Sen Huang,et al.  On the Applicability of Speaker Diarization to Audio Concept Detection for Multimedia Retrieval , 2011, 2011 IEEE International Symposium on Multimedia.

[16]  Chuohao Yeo,et al.  Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem , 2010, TOMCCAP.

[17]  Fabio Valente,et al.  Multistream speaker diarization beyond two acoustic feature streams , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Gerald Friedland,et al.  Joke-o-mat: browsing sitcoms punchline by punchline , 2009, ACM Multimedia.

[19]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[23]  Xavier Anguera Miró,et al.  Robust speaker diarization for meetings: ICSI RT06s evaluation system , 2006, INTERSPEECH.

[24]  James R. Glass,et al.  Exploiting Intra-Conversation Variability for Speaker Diarization , 2011, INTERSPEECH.

[25]  David A. van Leeuwen,et al.  Speech overlap detection in a two-pass speaker diarization system , 2009, INTERSPEECH.

[26]  Hervé Bourlard,et al.  Unknown-multiple speaker clustering using HMM , 2002, INTERSPEECH.

[27]  Nicholas W. D. Evans,et al.  The lia-eurecom RT'09 speaker diarization system: Enhancements in speaker modelling and cluster purification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Christian A. Müller,et al.  A fast-match approach for robust, faster than real-time speaker diarization , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[29]  Hagai Aronowitz,et al.  A distance measure between GMMs based on the unscented transform and its application to speaker recognition , 2005, INTERSPEECH.

[30]  Andreas Stolcke,et al.  The Meeting Project at ICSI , 2001, HLT.

[31]  Herbert Gish,et al.  Segregation of speakers for speech recognition and speaker identification , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[32]  Jitendra Ajmera,et al.  A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[33]  Pedro Cano,et al.  A Review of Audio Fingerprinting , 2005, J. VLSI Signal Process..

[34]  Shrikanth S. Narayanan,et al.  Audio retrieval by latent perceptual indexing , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Cordelia Schmid,et al.  An Image-Based Approach to Video Copy Detection With Spatio-Temporal Post-Filtering , 2010, IEEE Transactions on Multimedia.

[36]  Avery Wang,et al.  An Industrial Strength Audio Search Algorithm , 2003, ISMIR.

[37]  Marijn Huijbregts,et al.  The blame game: performance analysis of speaker diarization system components , 2007, INTERSPEECH.

[38]  Shrikanth S. Narayanan,et al.  Strategies to Improve the Robustness of Agglomerative Hierarchical Clustering Under Data Source Variation for Speaker Diarization , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Fabio Valente,et al.  An Information Theoretic Approach to Speaker Diarization of Meeting Data , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Hervé Bourlard,et al.  Improved overlap speech diarization of meeting recordings using long-term conversational features , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  Andreas Stolcke,et al.  Leveraging speaker diarization for meeting recognition from distant microphones , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[43]  Michael A. Casey,et al.  Fast Recognition of Remixed Music Audio , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[44]  Gerald Friedland,et al.  Improved Overlapped Speech Handling for Speaker Diarization , 2011, INTERSPEECH.

[45]  Gerald Friedland,et al.  Tuning-Robust Initialization Methods for Speaker Diarization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[46]  Jing Wu,et al.  Content-Based Audio Retrieval Using Perceptual Hash , 2008, 2008 International Conference on Intelligent Information Hiding and Multimedia Signal Processing.

[47]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[48]  Patrick Cardinal,et al.  Content-based video copy detection using nearest-neighbor mapping , 2010, 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA).

[49]  David A. van Leeuwen,et al.  Speaker Diarization Error Analysis Using Oracle Components , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  Marijn Huijbregts,et al.  The ICSI RT07s Speaker Diarization System , 2007, CLEAR.

[51]  Ton Kalker,et al.  A Highly Robust Audio Fingerprinting System , 2002, ISMIR.