Multimodal speaker identification in legislative discourse

A first-of-its-kind platform, Digital Democracy1 offers a searchable archive of all statements made in US state legislative hearings in four American states (California, New York, Texas and Florida) covering one third of the US population. The purpose of the platform is to increase government transparency in state legislatures. It allows citizens to follow state lawmakers, lobbyists, and advocates as they debate, craft, and vote on policy proposals. State hearings in the U.S. are typically recorded on video and broadcast on cable TV stations, but they are not transcribed or indexed. No official written records of these exist anywhere. Digital Democracy creates professional quality transcripts and provides them publicly in a searchable and video-synchronized format. In this paper, we focus on one of the main challenges in video transcription: identifying the speaker. While transcription provides searchable content of state hearings, people can only follow the dynamics of a debate if they know who said what and when during a hearing. Our speaker identification approach applies well-known voice recognition, deep-learning-based face recognition, and text understanding approaches to an ever growing pool of legislative hearing videos. All speaker recognition predictions from voice, face, and text are combined by a ranking scheme, and ultimately integrated back into the transcription process. We describe the current architecture, technologies, and multimodal fusion used for speaker identification within the Digital Democracy project. We show that our multimodal fusion approach increases "top 5" legislator identification accuracy to 80% in California and 58% in New York.

[1]  Yajie Tian,et al.  Handbook of face recognition , 2003 .

[2]  Andrew Zisserman,et al.  Taking the bite out of automated naming of characters in TV video , 2009, Image Vis. Comput..

[3]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[4]  Gregory Gelly,et al.  Improving Speaker Diarization of TV Series using Talking-Face Detection and Clustering , 2016, ACM Multimedia.

[5]  Ramazan Savas Aygün,et al.  Unsupervised Speaker Identification for TV News , 2016, IEEE MultiMedia.

[6]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[7]  Matti Pietikäinen,et al.  Performance evaluation of texture measures with classification based on Kullback discrimination of distributions , 1994, Proceedings of 12th International Conference on Pattern Recognition.

[8]  Chuan Wang,et al.  Look, Listen and Learn - A Multimodal LSTM for Speaker Identification , 2016, AAAI.

[9]  Anil K. Jain,et al.  Handbook of Face Recognition, 2nd Edition , 2011 .

[10]  Alex Dekhtyar,et al.  Measuring Legislative Behavior: An Exploration of Digitaldemocracy.org , 2017 .

[11]  Jonathan G. Fiscus,et al.  Multimodal Technologies for Perception of Humans, International Evaluation Workshops CLEAR 2007 and RT 2007, Baltimore, MD, USA, May 8-11, 2007, Revised Selected Papers , 2008, CLEAR.

[12]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[13]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[14]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.