Speakers Role Recognition in Multiparty Audio Recordings Using Social Network Analysis and Duration Distribution Modeling

This paper presents two approaches for speaker role recognition in multiparty audio recordings. The experiments are performed over a corpus of 96 radio bulletins corresponding to roughly 19 h of material. Each recording involves, on average, 11 speakers playing one among six roles belonging to a predefined set. Both proposed approaches start by segmenting automatically the recordings into single speaker segments, but perform role recognition using different techniques. The first approach is based on Social Network Analysis, the second relies on the intervention duration distribution across different speakers. The two approaches are used separately and combined and the results show that around 85% of the recording time can be labeled correctly in terms of role.

[1]  Lin-shan Lee,et al.  Spoken document understanding and organization , 2005, IEEE Signal Processing Magazine.

[2]  Jack Y. B. Lee Channel folding - an algorithm to improve efficiency of multicast video-on-demand systems , 2005, IEEE Transactions on Multimedia.

[3]  Mubarak Shah,et al.  A framework for segmentation of talk and game shows , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[4]  Marcel Worring,et al.  Multimodal Video Indexing : A Review of the State-ofthe-art , 2001 .

[5]  Jitendra Ajmera,et al.  Robust audio segmentation , 2004 .

[6]  Patrick Bouthemy,et al.  Unsupervised soccer video abstraction based on pitch, dominant color and camera motion analysis , 2004, MULTIMEDIA '04.

[7]  Samy Bengio,et al.  Modeling individual and group actions in meetings with layered HMMs , 2006, IEEE Transactions on Multimedia.

[8]  S. Wasserman,et al.  Social Network Analysis: Computer Programs , 1994 .

[9]  Samy Bengio,et al.  Automatic analysis of multimodal group actions in meetings , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Julia Hirschberg,et al.  The Rules Behind Roles: Identifying Speaker Role in Radio Broadcasts , 2000, AAAI/IAAI.

[11]  I. Miller Probability, Random Variables, and Stochastic Processes , 1966 .

[12]  S. Garrod,et al.  Group Discussion as Interactive Dialogue or as Serial Monologue: The Influence of Group Size , 2000, Psychological science.

[13]  John Scott Social Network Analysis , 1988 .

[14]  Samy Bengio,et al.  Extracting information from multimedia meeting collections , 2005, MIR '05.

[15]  Alan Hanjalic,et al.  Affective video content representation and modeling , 2005, IEEE Transactions on Multimedia.

[16]  S. Renals,et al.  Content-based access to spoken audio , 2005, IEEE Signal Processing Magazine.

[17]  Guy J. Brown,et al.  Speech and crosstalk detection in multichannel audio , 2005, IEEE Transactions on Speech and Audio Processing.

[18]  C.-C. Jay Kuo,et al.  Audio content analysis for online audiovisual data segmentation and classification , 2001, IEEE Trans. Speech Audio Process..

[19]  Gareth J. F. Jones,et al.  Affect-based indexing and retrieval of films , 2005, MULTIMEDIA '05.

[20]  John G. Proakis,et al.  Probability, random variables and stochastic processes , 1985, IEEE Trans. Acoust. Speech Signal Process..

[21]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[22]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[23]  Alan Hanjalic,et al.  Adaptive extraction of highlights from a sport video based on excitement modeling , 2005, IEEE Transactions on Multimedia.

[24]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[25]  Lie Lu,et al.  A generic framework of user attention model and its application in video summarization , 2005, IEEE Trans. Multim..

[26]  Jitendra Ajmera,et al.  A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[27]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[28]  Marcel Worring,et al.  Multimedia event-based video indexing using time intervals , 2005, IEEE Transactions on Multimedia.

[29]  Noboru Babaguchi,et al.  Event based indexing of broadcasted sports video by intermodal collaboration , 2002, IEEE Trans. Multim..

[30]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[31]  Bruno O. Shubert,et al.  Random variables and stochastic processes , 1979 .

[32]  John Scott What is social network analysis , 2010 .