Audio-Video detection of the active speaker in meetings

Meetings are a common activity that provides certain challenges when creating systems that assist them. Such is the case of the Active speaker detection, which can provide useful information for human interaction modeling, or human-robot interaction. Active speaker detection is mostly done using speech, however, certain visual and contextual information can provide additional insights. In this paper we propose an active speaker detection framework that integrates audiovisual features with social information, from the meeting context. Visual cue is processed using a Convolutional Neural Network (CNN) that captures the spatio-temporal relationships. We analyze several CNN architectures with both cues: raw pixels (RGB images) and motion (estimated with optical flow). Contextual reasoning is done with an original methodology, based on the gaze of all participants. We evaluate our proposal with a public benchmark in state-of-art: AMI corpus. We show how the addition of visual and context information improves the performance of the active speaker detection.

[1]  Matti Pietikäinen,et al.  A review of recent advances in visual speech decoding , 2014, Image Vis. Comput..

[2]  Quan Wang,et al.  Speaker Diarization with LSTM , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Maja Pantic,et al.  Visual-Only Recognition of Normal, Whispered and Silent Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Marek Hrúz,et al.  Convolutional Neural Network for speaker change detection in telephone speaker diarization system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Jean Carletta,et al.  Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus , 2007, Lang. Resour. Evaluation.

[7]  Jean-Marc Odobez,et al.  Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media , 2016, ACM Multimedia.

[8]  Rita Cucchiara,et al.  POSEidon: Face-from-Depth for Driver Pose Estimation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Pavel Korshunov,et al.  Tampered Speaker Inconsistency Detection with Phonetically Aware Audio-visual Features , 2019, ICML 2019.

[10]  Carlos Busso,et al.  End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models , 2018, Speech Commun..

[11]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Peter Robinson,et al.  3D Constrained Local Model for rigid and non-rigid facial tracking , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Gang Liu,et al.  A Differential Approach for Gaze Estimation , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Joon Son Chung,et al.  Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..

[16]  Rong Chen,et al.  A PCA Based Visual DCT Feature Extraction Method for Lip-Reading , 2006, 2006 International Conference on Intelligent Information Hiding and Multimedia.

[17]  Jonas Beskow,et al.  Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially Aware Language Acquisition , 2017, IEEE Transactions on Cognitive and Developmental Systems.

[18]  Nicholas W. D. Evans,et al.  The EURECOM Submission to the First DIHARD Challenge , 2018, INTERSPEECH.

[19]  Paavo Alku,et al.  Speaker recognition from whispered speech: A tutorial survey and an application of time-varying linear prediction , 2018, Speech Commun..

[20]  Joon Son Chung,et al.  Utterance-level Aggregation for Speaker Recognition in the Wild , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Petr Motlícek,et al.  Deep Neural Networks for Multiple Speaker Detection and Localization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[22]  John H. L. Hansen,et al.  An unsupervised visual-only voice activity detection approach using temporal orofacial features , 2015, INTERSPEECH.

[23]  Sivaji Bandyopadhyay,et al.  Says Who? Deep Learning Models for Joint Speech Recognition, Segmentation and Diarization , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Pierre-Michel Bousquet,et al.  Speaker Modeling Using Local Binary Decisions , 2011, INTERSPEECH.

[25]  Horst Bischof,et al.  Annotated Facial Landmarks in the Wild: A large-scale, real-world database for facial landmark localization , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[26]  Rama Chellappa,et al.  HyperFace: A Deep Multi-Task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.