Speaker Turn Detection Based on Multimodal Situation Analysis

The main stage of speaker diarization is a detection of time labels, where speakers are changed. The most of the approaches to the decision of the speaker turn detection problem is focused on processing of audio signal captured in one channel and applied for archive records. Recently the problem of speaker diarization became to be considered from multimodal point of view. In this paper we outline modern methods of audio and video signal processing and personification data analysis for multimodal speaker diarization. The proposed PARAD-R software for Russian speech analysis implemented for audio speaker diarization and will be enhanced based on advances of multimodal situation analysis in a meeting room.

[1]  Alice Caplier,et al.  Accurate and quasi-automatic lip tracking , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[2]  Andrey Ronzhin,et al.  From smart devices to smart space , 2010 .

[3]  Hervé Bourlard,et al.  Audio-visual synchronisation for speaker diarisation , 2010, INTERSPEECH.

[4]  Gwenn Englebienne,et al.  Multimodal Speaker Diarization , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Peng Dai,et al.  Audio-Visual Fused Online Context Analysis Toward Smart Meeting Room , 2007, UIC.

[6]  Andreas Stolcke,et al.  Multispeaker speech activity detection for the ICSI meeting recorder , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[7]  Andrey Ronzhin,et al.  Very Large Vocabulary ASR for Spoken Russian with Syntactic and Morphemic Analysis , 2011, INTERSPEECH.

[8]  Mark J. F. Gales,et al.  The Cambridge University March 2005 speaker diarisation system , 2005, INTERSPEECH.

[9]  Chuohao Yeo,et al.  Multi-modal speaker diarization of real-world meetings using compressed-domain video features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Alessio Brutti,et al.  Speaker Localization in CHIL Lectures: Evaluation Criteria and Results , 2005, MLMI.

[11]  Andrey Ronzhin,et al.  Event-Driven Content Management System for Smart Meeting Room , 2011, NEW2AN.

[12]  Andrey Ronzhin,et al.  Multimodal Interaction with Intelligent Meeting Room Facilities from Inside and Outside , 2009, NEW2AN.

[13]  Andrey Ronzhin,et al.  Speech recognition for east Slavic languages: the case of Russian , 2012, SLTU.

[14]  Malcolm Slaney,et al.  FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[15]  Jean Carletta,et al.  Nonverbal behaviours improving a simulation of small group discussion , 2003 .

[16]  Yannis Stylianou,et al.  Video and audio based detection of filled hesitation pauses in classroom lectures , 2009, 2009 17th European Signal Processing Conference.

[17]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[18]  Alexander L. Ronzhin,et al.  A Video Monitoring Model with a Distributed Camera System for the Smart Space , 2010, NEW2AN.

[19]  Alexey Karpov,et al.  Analysis of long-distance word dependencies and pronunciation variability at conversational Russian speech recognition , 2012, 2012 Federated Conference on Computer Science and Information Systems (FedCSIS).

[20]  Hervé Bourlard,et al.  Using audio and visual cues for speaker diarisation initialisation , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.