Lexical speaker identification in TV shows

It is possible to use lexical information extracted from speech transcripts for speaker identification (SID), either on its own or to improve the performance of standard cepstral-based SID systems upon fusion. This was established before typically using isolated speech from single speakers (NIST SRE corpora, parliamentary speeches). On the contrary, this work applies lexical approaches for SID on a different type of data. It uses the REPERE corpus consisting of unsegmented multiparty conversations, mostly debates, discussions and Q&A sessions from TV shows. It is hypothesized that people give out clues to their identity when speaking in such settings which this work aims to exploit. The impact on SID performance of the diarization front-end required to pre-process the unsegmented data is also measured. Four lexical SID approaches are studied in this work, including TFIDF, BM25 and LDA-based topic modeling. Results are analysed in terms of TV shows and speaker roles. Lexical approaches achieve low error rates for certain speaker roles such as anchors and journalists, sometimes lower than a standard cepstral-based Gaussian Supervector - Support Vector Machine (GSV-SVM) system. Also, in certain cases, the lexical system shows modest improvement over the cepstral-based system performance using score-level sum fusion. To highlight the potential of using lexical information not just to improve upon cepstral-based SID systems but as an independent approach in its own right, initial studies on crossmedia SID is briefly reported. Instead of using speech data as all cepstral systems require, this approach uses Wikipedia texts to train lexical speaker models which are then tested on speech transcripts to identify speakers.

[1]  Olivier Galibert,et al.  The REPERE Corpus : a multimodal corpus for person recognition , 2012, LREC.

[2]  Jean-Luc Gauvain,et al.  Multistage speaker diarization of broadcast news , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Patrick Kenny,et al.  Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification , 2009, INTERSPEECH.

[4]  Gökhan Tür,et al.  Duration and pronunciation conditioned lexical modeling for speaker verification , 2007, INTERSPEECH.

[5]  Elizabeth Shriberg,et al.  Higher-Level Features in Speaker Recognition , 2007, Speaker Classification.

[6]  Douglas D. O'Shaughnessy,et al.  Multitaper MFCC and PLP features for speaker verification using i-vectors , 2013, Speech Commun..

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  Andreas Stolcke,et al.  SRI's 2004 NIST speaker recognition evaluation system , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[9]  Claude Barras,et al.  On the use of GSV-SVM for Speaker Diarization and Tracking , 2010, Odyssey.

[10]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[11]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[12]  A. Nayeemulla Khan,et al.  Latent semantic analysis for speaker recognition , 2004, INTERSPEECH.

[13]  George R. Doddington,et al.  Speaker recognition based on idiolectal differences between speakers , 2001, INTERSPEECH.

[14]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[15]  Sue Tranter Who Really Spoke When? Finding Speaker Turns and Identities in Broadcast News Audio , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[16]  Jean-Luc Gauvain,et al.  Partitioning and transcription of broadcast news data , 1998, ICSLP.

[17]  L. Lamel,et al.  A comparative study using manual and automatic transcriptions for diarization , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[18]  Sridha Sridharan,et al.  Improved phonetic and lexical speaker recognition through MAP adaptation , 2004, Odyssey.

[19]  Olivier Galibert,et al.  The First Official REPERE Evaluation , 2013, SLAM@INTERSPEECH.

[20]  Ngoc Thang Vu,et al.  Speech recognition for machine translation in Quaero , 2011, IWSLT.

[21]  Hynek Hermansky,et al.  Developing a speaker identification system for the DARPA RATS project , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Julie Mauclair,et al.  Speaker Diarization: About whom the Speaker is Talking ? , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[23]  Doris Baum Recognising speakers from the topics they talk about , 2012, Speech Commun..

[24]  Lori Lamel,et al.  Comparing Multi-Stage Approaches for Cross-Show Speaker Diarization , 2011, INTERSPEECH.

[25]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  William M. Campbell,et al.  Phonetic Speaker Recognition with Support Vector Machines , 2003, NIPS.

[27]  Jean-Luc Gauvain,et al.  The LIMSI SDR System for TREC-8 , 1999, TREC.

[28]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..