论文信息 - GTH-UPM System for Albayzin Multimodal Diarization Challenge 2020

GTH-UPM System for Albayzin Multimodal Diarization Challenge 2020

This paper describes the multimodal diarization system proposed by the GTH-UPM team to Albayzin Multimodal Diarization Challenge 2020. The submitted solution consists of 2 separate diarization systems that work on visual and aural components. The visual diarization solution exploits web resources, as well as provided enrollment images. First, these images feed a facial detector. Next, all the discovered faces are introduced into FaceNet to generate embeddings. After this, we apply a clustering algorithm on extracted embeddings, obtaining a representative cluster for each participant. Each centroid of the representative clusters acts as a participant model. When a new embedding extracted from a facial image of the program arrives at the system, it receives the label that corresponds to the closest centroid identity among all the given participants, as long as it exceeds a fixed quality threshold. The aural speaker diarization problem is tackled as a classification task, in which a deep learning model learns the mapping between automatically-extracted sequences of aural x-vectors and speaker identities. These sequences aid in overcoming the scarcity of training samples per speaker. The best results sent reached a DER of 66.94% for visual diarization and a DER of 125.24% for aural diarization on the test set.

[1] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[2] Kenneth Ward Church,et al. Third DIHARD Challenge Evaluation Plan , 2020, ArXiv.

[3] Yu Qiao,et al. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[4] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Nicholas W. D. Evans,et al. ODESSA/PLUMCOT at Albayzin Multimodal Diarization Challenge 2018 , 2018, IberSPEECH.

[7] Marie Kunesová,et al. Audio-Video Speaker Diarization for Unsupervised Speaker and Face Model Creation , 2014, TSD.

[8] Laura Docío Fernández,et al. The GTM-UVIGO System for Audiovisual Diarization , 2018, IberSPEECH.

[9] Eduardo Lleida,et al. Domain Adaptation of PLDA Models in Broadcast Diarization by Means of Unsupervised Speaker Clustering , 2017, INTERSPEECH.

[10] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[11] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[12] Naoyuki Kanda,et al. End-to-End Neural Speaker Diarization with Self-Attention , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13] Josep Ramon Morros,et al. UPC Multimodal Speaker Diarization System for the 2018 Albayzin Challenge , 2018, IberSPEECH.

[14] Suramya Tomar,et al. Converting video formats with FFmpeg , 2006 .

[15] Jitendra Ajmera,et al. A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[16] Quan Wang,et al. Fully Supervised Speaker Diarization , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] P. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[18] Javier Lorenzo-Navarro,et al. Who is Really Talking? A Visual-Based Speaker Diarization Strategy , 2017, EUROCAST.