Speaker Clustering Aided by Visual Dialogue Analysis

Speaker clustering aims to automatically cluster speech segments for each speaker. By speaker clustering, we can discover main cast list from long videos and retrieve their relevant video clips for efficient browsing. In this paper, we propose a dialogue supervised speaker clustering method, which makes use of the visual dialogue analysis results to improve the performance of speaker clustering. Compared with the traditional approach based only on acoustic features, the dialogue supervised speaker clustering approach can get significant improvement on the clustering result for movie and TV series.

[1]  Wei-Ying Ma,et al.  Image and Video Retrieval , 2003, Lecture Notes in Computer Science.

[2]  Christian Wellekens,et al.  DISTBIC: A speaker-based segmentation for audio data indexing , 2000, Speech Commun..

[3]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[4]  Noel E. O'Connor,et al.  Dialogue scene detection in movies using low and mid-level visual features , 2004 .

[5]  Tao Wang,et al.  Caption-aided speech detection in videos , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Christian Wellekens,et al.  Audio data indexing: Use of second-order statistics for speaker-based segmentation , 1999, Proceedings IEEE International Conference on Multimedia Computing and Systems.

[7]  Lei Chen,et al.  Incorporating Audio Cues into Dialog and Action Scene Extraction , 2003, IS&T/SPIE Electronic Imaging.

[8]  Noel E. O'Connor,et al.  Dialogue Sequence Detection in Movies , 2005, CIVR.

[9]  Douglas A. Reynolds,et al.  Approaches and applications of audio diarization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10]  M. A. Siegler,et al.  Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[11]  Mubarak Shah,et al.  A Framework for Semantic Classification of Scenes Using Finite State Machines , 2004, CIVR.

[12]  Tom E. Bishop,et al.  Blind Image Restoration Using a Block-Stationary Signal Model , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.