Speaker Change Detection in Broadcast TV Using Bidirectional Long Short-Term Memory Networks

Speaker change detection is an important step in a speaker diarization system. It aims at finding speaker change points in the audio stream. In this paper, it is treated as a sequence labeling task and addressed by Bidirectional long short term memory networks (Bi-LSTM). The system is trained and evaluated on the Broadcast TV subset from ETAPE database. The result shows that the proposed model brings good improvement over conventional methods based on BIC and Gaussian Divergence. For instance, in comparison to Gaussian divergence, it produces speech turns that are 19.5% longer on average, with the same level of purity.

[1]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Olivier Galibert,et al.  The ETAPE speech processing evaluation , 2014, LREC.

[3]  Mauro Cettolo Segmentation, classification and clustering of an Italian broadcast news corpus , 2000 .

[4]  Olivier Galibert,et al.  Methodologies for the evaluation of speaker diarization and automatic speech recognition in the presence of overlapping speech , 2013, INTERSPEECH.

[5]  Jonathan G. Fiscus,et al.  The Rich Transcription 2005 Spring Meeting Recognition Evaluation , 2005, MLMI.

[6]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[7]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[8]  Jean-Luc Gauvain,et al.  Partitioning and transcription of broadcast news data , 1998, ICSLP.

[9]  Hervé Bredin,et al.  TristouNet: Triplet loss for speaker turn embedding , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Andreas Stolcke,et al.  Artificial neural network features for speaker diarization , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[11]  Olivier Galibert,et al.  The ETAPE corpus for the evaluation of speech-based TV content processing in the French language , 2012, LREC.

[12]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[13]  Hervé Bredin,et al.  pyannote.metrics: A Toolkit for Reproducible Evaluation, Diagnostic, and Error Analysis of Speaker Diarization Systems , 2017, INTERSPEECH.

[14]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[15]  Jean-Luc Gauvain,et al.  Minimum word error training of RNN-based voice activity detection , 2015, INTERSPEECH.

[16]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[17]  Jean-Luc Gauvain,et al.  Multistage speaker diarization of broadcast news , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[19]  M. A. Siegler,et al.  Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[20]  D. Signorini,et al.  Neural networks , 1995, The Lancet.

[21]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Thomas Fillon,et al.  YAAFE, an Easy to Use and Efficient Audio Feature Extraction Software , 2010, ISMIR.

[23]  Jean-Pierre Martens,et al.  Factor analysis for speaker segmentation and improved speaker diarization , 2015, INTERSPEECH.