GTH-UPM System for Albayzin Multimodal Diarization Challenge 2020

This paper describes the multimodal diarization system proposed by the GTH-UPM team to Albayzin Multimodal Diarization Challenge 2020. The submitted solution consists of 2 separate diarization systems that work on visual and aural components. The visual diarization solution exploits web resources, as well as provided enrollment images. First, these images feed a facial detector. Next, all the discovered faces are introduced into FaceNet to generate embeddings. After this, we apply a clustering algorithm on extracted embeddings, obtaining a representative cluster for each participant. Each centroid of the representative clusters acts as a participant model. When a new embedding extracted from a facial image of the program arrives at the system, it receives the label that corresponds to the closest centroid identity among all the given participants, as long as it exceeds a fixed quality threshold. The aural speaker diarization problem is tackled as a classification task, in which a deep learning model learns the mapping between automatically-extracted sequences of aural x-vectors and speaker identities. These sequences aid in overcoming the scarcity of training samples per speaker. The best results sent reached a DER of 66.94% for visual diarization and a DER of 125.24% for aural diarization on the test set.

[1]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[2]  Kenneth Ward Church,et al.  Third DIHARD Challenge Evaluation Plan , 2020, ArXiv.

[3]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[4]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Nicholas W. D. Evans,et al.  ODESSA/PLUMCOT at Albayzin Multimodal Diarization Challenge 2018 , 2018, IberSPEECH.

[7]  Marie Kunesová,et al.  Audio-Video Speaker Diarization for Unsupervised Speaker and Face Model Creation , 2014, TSD.

[8]  Laura Docío Fernández,et al.  The GTM-UVIGO System for Audiovisual Diarization , 2018, IberSPEECH.

[9]  Eduardo Lleida,et al.  Domain Adaptation of PLDA Models in Broadcast Diarization by Means of Unsupervised Speaker Clustering , 2017, INTERSPEECH.

[10]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[11]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[12]  Naoyuki Kanda,et al.  End-to-End Neural Speaker Diarization with Self-Attention , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13]  Josep Ramon Morros,et al.  UPC Multimodal Speaker Diarization System for the 2018 Albayzin Challenge , 2018, IberSPEECH.

[14]  Suramya Tomar,et al.  Converting video formats with FFmpeg , 2006 .

[15]  Jitendra Ajmera,et al.  A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[16]  Quan Wang,et al.  Fully Supervised Speaker Diarization , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[18]  Javier Lorenzo-Navarro,et al.  Who is Really Talking? A Visual-Based Speaker Diarization Strategy , 2017, EUROCAST.