Shot Classification and Keyframe Detection for Vision Based Speakers Diarization in Parliamentary Debates

Automatic labelling of speakers is an essential task for speakers diarization in parliamentary debates given the huge amount of video data to annotate. In this paper, we address the speaker diarization problem as a visual speaker re-identification issue with a special emphasis on the analysis of different shot types. We propose two approaches that makes use of convolutional neural networks (CNN) and biometric traits for keyframe extraction. Experimental results have been evaluated with challenging real-world datasets from the Canary Islands Parliament, and contrasted with a similar approach that does not analyze the shot type. Results show that the use of CNN for shot classification and biometric traits help to improve the performance of the re-identification outcomes in an average rate of 9.8 %.

[1]  Oscar Déniz-Suárez,et al.  A comparison of face and facial feature detectors based on the Viola–Jones general object detection framework , 2011, Machine Vision and Applications.

[2]  Theodoros Giannakopoulos,et al.  Audio-visual speaker diarization using fisher linear semi-discriminant analysis , 2014, Multimedia Tools and Applications.

[3]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[4]  Azriel Rosenfeld,et al.  Face recognition: A literature survey , 2003, CSUR.

[5]  Nikita Sao,et al.  A survey based on Video Shot Boundary Detection techniques , 2014 .

[6]  Hervé Bourlard,et al.  Using audio and visual cues for speaker diarisation initialisation , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  Anastasios Tefas,et al.  Multimodal speaker clustering in full length movies , 2015, Multimedia Tools and Applications.

[9]  Javier Lorenzo-Navarro,et al.  A multimedia system to produce and deliver video fragments on demand on parliamentary websites , 2017, Multimedia Tools and Applications.

[10]  Shree K. Nayar,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence Describable Visual Attributes for Face Verification and Image Search , 2022 .

[11]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[12]  Javier Ferreiros,et al.  Speaker Diarization Based on Intensity Channel Contribution , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  T. Teixeira,et al.  A Survey of Human-Sensing : Methods for Detecting Presence , Count , Location , Track , and Identity , 2010 .

[14]  Gwenn Englebienne,et al.  Multimodal Speaker Diarization , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Uma Mudenagudi,et al.  A Study on Keyframe Extraction Methods for Video Summary , 2011, 2011 International Conference on Computational Intelligence and Communication Networks.

[16]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Slim Essid,et al.  A Multimodal Approach to Speaker Diarization on TV Talk-Shows , 2013, IEEE Transactions on Multimedia.

[18]  Louahdi Khoudour,et al.  People re-identification by spectral classification of silhouettes , 2010, Signal Process..