Diarization and Identity Attribution Compatibility in the Albayzin 2020 Challenge

The current need to identify the speakers in a certain recording has evolved along time, requesting more and more information. While speaker recognition originally focused on determining whether a speaker talks in a certain audio with a single speaker, later diarization focused on differentiating speakers along the recording. The latest step is Identity Assignment (IA), which combines both of them, i.e., deciding whether a certain speaker is present in a given audio, as well as determining the periods of time when the speaker is active. Our work presents and analyzes the ViVoLAB results for the Albayzı́n 2020 evaluation, focused on diarization and identity assignment. These challenges will be faced in the broadcast domain, with data coming from national Spanish TV Corporation RTVE. For this purpose we have developed a BottomUp diarization architecture based on the embedding-PLDA paradigm. On top of the diarization solution we have added an identity assignment block, based on the speaker verification approach.

[1]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[2]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[3]  Eduardo Lleida,et al.  Unsupervised adaptation of PLDA by using variational Bayes methods , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Eduardo Lleida,et al.  Variational Bayesian PLDA for speaker diarization in the MGB challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[5]  Taras Butko,et al.  Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion , 2011, EURASIP J. Audio Speech Music. Process..

[6]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[7]  Quan Wang,et al.  Fully Supervised Speaker Diarization , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Sanjeev Khudanpur,et al.  State-of-the-Art Speaker Recognition for Telephone and Video Speech: The JHU-MIT Submission for NIST SRE18 , 2019, INTERSPEECH.

[9]  Douglas D. O'Shaughnessy,et al.  Comparative Evaluation of Feature Normalization Techniques for Speaker Verification , 2011, NOLISP.

[10]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[11]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[13]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[14]  Alfonso Ortega Giménez,et al.  ViVoLAB Speaker Diarization System for the DIHARD 2019 Challenge , 2019, INTERSPEECH.

[15]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[16]  P. Motlícek,et al.  Variational Bayesian speaker diarization of meeting recordings , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  A. Ortega,et al.  Albayzin 2018 Evaluation: The IberSpeech-RTVE Challenge on Speech Technologies for Spanish Broadcast Media , 2019, Applied Sciences.

[18]  Alan McCree,et al.  Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Eduardo Lleida,et al.  Unsupervised adaptation of PLDA models for broadcast diarization , 2019, EURASIP Journal on Audio, Speech, and Music Processing.

[20]  Mark J. F. Gales,et al.  The MGB challenge: Evaluating multi-genre broadcast media recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).