Who said that?: Audio-visual speaker diarisation of real-world meetings

The goal of this work is to determine 'who spoke when' in real-world meetings. The method takes surround-view video and single or multi-channel audio as inputs, and generates robust diarisation outputs. To achieve this, we propose a novel iterative approach that first enrolls speaker models using audio-visual correspondence, then uses the enrolled models together with the visual information to determine the active speaker. We show strong quantitative and qualitative performance on a dataset of real-world meetings. The method is also evaluated on the public AMI meeting corpus, on which we demonstrate results that exceed all comparable methods. We also show that beamforming can be used together with the video to further improve the performance when multi-channel audio is available.

[1]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[2]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[3]  Omkar M. Parkhi,et al.  VGGFace2: A Dataset for Recognising Faces across Pose and Age , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[4]  Pietro Laface,et al.  Probabilistic linear discriminant analysis of i-vector posterior distributions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Joon Son Chung,et al.  Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Theodoros Giannakopoulos,et al.  Audio-visual speaker diarization using fisher linear semi-discriminant analysis , 2014, Multimedia Tools and Applications.

[9]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[10]  Gerald Friedland,et al.  The ICSI RT-09 Speaker Diarization System , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Lukás Burget,et al.  Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[13]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[14]  Gerald Friedland,et al.  Towards Audio-Visual On-line Diarization Of Participants In Group Meetings , 2008, ECCV 2008.

[15]  Dan Istrate,et al.  NIST RT'05S Evaluation: Pre-processing Techniques and Speaker Diarization on Multiple Microphone Meetings , 2005, MLMI.

[16]  Giorgio Biagetti,et al.  Robust Speaker Identification in a Meeting with Short Audio Segments , 2016 .

[17]  Reinhold Häb-Umbach,et al.  Fusing audio and video information for online speaker diarization , 2009, INTERSPEECH.

[18]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[19]  Shinji Watanabe,et al.  Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge , 2018, INTERSPEECH.

[20]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Chuohao Yeo,et al.  Multi-modal speaker diarization of real-world meetings using compressed-domain video features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Joseph H. DiBiase A High-Accuracy, Low-Latency Technique for Talker Localization in Reverberant Environments Using Microphone Arrays , 2000 .

[24]  Panayiotis G. Georgiou,et al.  Multimodal Speaker Segmentation and Identification in Presence of Overlapped Speech Segments , 2010, J. Multim..

[25]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[28]  Richard C. Rose,et al.  Deep bottleneck features for i-vector based text-independent speaker verification , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[29]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[30]  Jun Du,et al.  Speaker Diarization with Enhancing Speech for the First DIHARD Challenge , 2018, INTERSPEECH.

[31]  Chuohao Yeo,et al.  Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem , 2010, TOMCCAP.

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Daniel C. Burnett,et al.  WebRTC: APIs and RTCWEB Protocols of the HTML5 Real-Time Web , 2012 .

[35]  Nicolás Ruiz-Reyes,et al.  Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis , 2018, Multimedia Tools and Applications.

[36]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.