Improving speaker identification in TV-shows using person name detection in overlaid text and speech

This paper is dedicated to the use of auxiliary information in order to help a classical acoustic-based speaker identification system in the specific context of TV shows. The underlying assumption is that auxiliary information could help (1) to rerank n-best speaker hypotheses provided by the acoustic-based only speaker identification system, (2) to provide confidence score to refine a rejection process (open-set identification task), and finally, (3) to identify speakers not covered by the speaker dictionary (out-of-dictionary speakers) used by the speaker identification system (full-set verification task); the last point being one of the main issue when dealing with TV shows. In this paper, the auxiliary information is based on person names detected in overlaid text and speech. Experiments conducted in three different datasets issued from the REPERE evaluation campaign have highlighted the interest of the auxiliary information used here, and notably the use of overlaid person names to identify out-of-dictionary speakers, confirming the key assumptions made.

[1]  Luis Javier Rodríguez-Fuentes,et al.  Improving robustness in open set speaker identification by shallow source modeling , 2008, Odyssey.

[2]  Amit Srivastava,et al.  Open-set speaker identification in broadcast news , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Bhuvana Ramabhadran,et al.  Vocabulary independent spoken term detection , 2007, SIGIR.

[4]  Sylvain Meignier,et al.  Automatic named identification of speakers using diarization and ASR systems , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Moshe Wasserblat,et al.  How to Deal with Multiple-Targets in Speaker Identification Systems? , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[6]  Elie el Khoury,et al.  Combining transcription-based and acoustic-based speaker identifications for broadcast news , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Delphine Charlet,et al.  Impact of overlapping speech detection on speaker diarization for broadcast news and debates , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Frédéric Béchet,et al.  Person name recognition and linking from overlay text in TV broadcast shows , 2014, SLAM@INTERSPEECH.

[9]  Georges Quénot,et al.  Unsupervised Speaker Identification using Overlaid Texts in TV Broadcast , 2012, INTERSPEECH.

[10]  Frédéric Béchet,et al.  Detecting person presence in TV shows with linguistic and structural features , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Olivier Galibert,et al.  The REPERE Corpus : a multimodal corpus for person recognition , 2012, LREC.

[12]  P. Sivakumaran,et al.  On the enhancement of speaker identification accuracy using weighted bilateral scoring , 2008, 2008 42nd Annual IEEE International Carnahan Conference on Security Technology.

[13]  Sue Tranter Who Really Spoke When? Finding Speaker Turns and Identities in Broadcast News Audio , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.