Speaker Diarization: A review

Speaker Diarization is the task of identifying start and end time of a speaker in an audio file, together with the identity of the speaker i.e. “who spoke when”. Diarization has many applications in speaker indexing, retrieval, speech recognition with speaker identification, diarizing meeting and lectures. In this paper, we have reviewed state-of-art approaches involving telephony, TV shows, broadcasting and meeting data. Along with the state-of-art approaches, the major approaches that are commonly used in diarization are reviewed. Few possible future directions of this technology are also identified.

[1]  H. Gish,et al.  Text-independent speaker identification , 1994, IEEE Signal Processing Magazine.

[2]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[3]  Jean-François Bonastre,et al.  Fast speaker diarization based on binary keys , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  William M. Campbell,et al.  Channel compensation for SVM speaker recognition , 2004, Odyssey.

[5]  Christian Wellekens,et al.  DISTBIC: A speaker-based segmentation for audio data indexing , 2000, Speech Commun..

[6]  Xavier Anguera Miró,et al.  Purity Algorithms for Speaker Diarization of Meetings Data , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[7]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[8]  Jitendra Ajmera,et al.  A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[9]  Bin Ma,et al.  Speaker diarization system for RT07 and RT09 meeting room audio , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Naftali Tishby,et al.  Agglomerative Multivariate Information Bottleneck , 2001, NIPS.

[11]  Jean-François Bonastre,et al.  AMIRAL: A Block-Segmental Multirecognizer Architecture for Automatic Speaker Recognition , 2000, Digit. Signal Process..

[12]  Patrick Kenny,et al.  Combining Gaussianized/Non-Gaussianized Features to Improve Speaker Diarization of Telephone Conversations , 2007, IEEE Signal Processing Letters.

[13]  Fabio Valente,et al.  An Information Theoretic Combination of MFCC and TDOA Features for Speaker Diarization , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Petr Motlícek,et al.  Combining SGMM speaker vectors and KL-HMM approach for speaker diarization , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Jean-François Bonastre,et al.  The ELISA consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich transcription evaluation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  X. Anguera,et al.  Speaker diarization for multi-party meetings using acoustic fusion , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[17]  Marc Ferras,et al.  Speaker diarization and linking of large corpora , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[18]  M. A. Siegler,et al.  Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[19]  Elie el Khoury,et al.  Improved speaker diarization system for meetings , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Barbara Peskin,et al.  TOWARDS ROBUST SPEAKER SEGMENTATION: THE ICSI-SRI FALL 2004 DIARIZATION SYSTEM , 2004 .

[21]  Jean-Luc Gauvain,et al.  Multistage speaker diarization of broadcast news , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Xavier Anguera Miró,et al.  Improved binary key speaker diarization system , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[23]  Fabio Valente,et al.  Agglomerative information bottleneck for speaker diarization of meetings data , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[24]  Xavier Anguera Miró,et al.  Fast Single- and Cross-Show Speaker Diarization Using Binary Key Speaker Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Petr Motlícek,et al.  System fusion and speaker linking for longitudinal diarization of TV shows , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Haizhou Li,et al.  T-test distance and clustering criterion for speaker diarization , 2008, INTERSPEECH.