Multimodal speaker clustering in full length movies

Multimodal clustering/diarization tries to answer the question ”who spoke when” by using audio and visual information. Diarizationconsists of two steps, at first segmentation of the audio information and detection of the speech segments and then clustering of the speech segments to group the speakers. This task has been mainly studied on audiovisual data from meetings, news broadcasts or talk shows. In this paper, we use visual information to aid speaker clustering and we introduce a new video-based feature, called actor presence that can be used to enhance audio-based speaker clustering. We tested the proposed method in three full length stereoscopic movies, i.e. a scenario much more difficult than the ones used so far, where there is no certainty that speech segments and video appearances of actors will always overlap. The results proved that the visual information can improve the speaker clustering accuracy and hence the diarization process.

[1]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[2]  Matti Pietikäinen,et al.  Performance evaluation of texture measures with classification based on Kullback discrimination of distributions , 1994, Proceedings of 12th International Conference on Pattern Recognition.

[3]  J. Calic,et al.  A Survey on Multimodal Video Representation for Semantic Retrieval , 2005, EUROCON 2005 - The International Conference on "Computer as a Tool".

[4]  Nicu Sebe,et al.  Event Oriented Dictionary Learning for Complex Event Detection , 2015, IEEE Transactions on Image Processing.

[5]  Anastasios Tefas,et al.  Facial image clustering in stereoscopic videos using double spectral analysis , 2015, Signal Process. Image Commun..

[6]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[7]  Chuohao Yeo,et al.  Multi-modal speaker diarization of real-world meetings using compressed-domain video features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Hervé Bourlard,et al.  Using audio and visual cues for speaker diarisation initialisation , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Stefanos Zafeiriou,et al.  Robust Discriminative Response Map Fitting with Constrained Local Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[11]  Philippe Joly,et al.  Audiovisual diarization of people in video content , 2012, Multimedia Tools and Applications.

[12]  Marcel Worring,et al.  Multimedia Event-Based Video Indexing: A Review of the State-of-the-art , 2005 .

[13]  Nicu Sebe,et al.  Multimodal Human Computer Interaction: A Survey , 2005, ICCV-HCI.

[14]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[15]  Alexandros Iosifidis,et al.  On the kernel Extreme Learning Machine speedup , 2015, Pattern Recognit. Lett..

[16]  Alexandros Iosifidis,et al.  On the kernel Extreme Learning Machine classifier , 2015, Pattern Recognit. Lett..

[17]  Anastasios Tefas,et al.  Facial image clustering in stereo videos using local binary patterns and double spectral analysis , 2014, 2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[18]  Lei Xie,et al.  Audio-visual human recognition using semi-supervised spectral learning and hidden Markov models , 2009, J. Vis. Lang. Comput..

[19]  Anastasios Tefas,et al.  Stereo object tracking with fusion of texture, color and disparity information , 2014, Signal Process. Image Commun..

[20]  Ioannis Pitas,et al.  Appearance based object tracking in stereo sequences , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Alexandros Iosifidis,et al.  Visual voice activity detection based on spatiotemporal information and bag of words , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[22]  Eshed Ohn-Bar,et al.  Joint Angles Similiarities and HOG 2 for Action Recognition , 2013 .

[23]  Ming Zhao,et al.  Audiovisual celebrity recognition in unconstrained web videos , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Gwenn Englebienne,et al.  Multimodal Speaker Diarization , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Khairuddin Omar,et al.  An enhanced face detection method using skin color and back-Prodagation neural network , 2011 .

[26]  Chuohao Yeo,et al.  Visual speaker localization aided by acoustic models , 2009, MM '09.

[27]  Manolis I. A. Lourakis,et al.  Tracking of Human Hands and Faces through Probabilistic Fusion of Multiple Visual Cues , 2008, ICVS.

[28]  Slim Essid,et al.  A Multimodal Approach to Speaker Diarization on TV Talk-Shows , 2013, IEEE Transactions on Multimedia.

[29]  Václav Hlavác,et al.  Detector of Facial Landmarks Learned by the Structured Output SVM , 2012, VISAPP.

[30]  Subramanian Ramanathan,et al.  On the relationship between head pose, social attention and personality prediction for unstructured and dynamic group interactions , 2013, ICMI '13.

[31]  Nicu Sebe,et al.  Analyzing Free-standing Conversational Groups: A Multimodal Approach , 2015, ACM Multimedia.

[32]  Radu Horaud,et al.  Audio-Visual Clustering for 3D Speaker Localization , 2008, MLMI.

[33]  Mohan M. Trivedi,et al.  Joint Angles Similarities and HOG2 for Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[34]  Ioannis Pitas,et al.  A monocular system for person tracking: Implementation and testing , 2008, Journal on Multimodal User Interfaces.