Combining Speaker Turn Embedding and Incremental Structure Prediction for Low-Latency Speaker Diarization

Real-time speaker diarization has many potential applications, including public security, biometrics or forensics. It can also significantly speed up the indexing of increasingly large multimedia archives. In this paper, we address the issue of lowlatency speaker diarization that consists in continuously detecting new or reoccurring speakers within an audio stream, and determining when each speaker is active with a low latency (e.g. every second). This is in contrast with most existing approaches in speaker diarization that rely on multiple passes over the complete audio recording. The proposed approach combines speaker turn neural embeddings with an incremental structure prediction approach inspired by state-of-the-art Natural Language Processing models for Part-of-Speech tagging and dependency parsing. It can therefore leverage both information describing the utterance and the inherent temporal structure of interactions between speakers to learn, in supervised framework, to identify speakers. Experiments on the Etape broadcast news benchmark validate the approach.

[1]  Giorgio Satta,et al.  Guided Learning for Bidirectional Sequence Classification , 2007, ACL.

[2]  Yu Qiao,et al.  A Discriminative Feature Learning Approach for Deep Face Recognition , 2016, ECCV.

[3]  Hervé Bredin,et al.  pyannote.metrics: A Toolkit for Reproducible Evaluation, Diagnostic, and Error Analysis of Speaker Diarization Systems , 2017, INTERSPEECH.

[4]  Daben Liu,et al.  Online speaker clustering , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[5]  Mohammad Hossein Moattar,et al.  A review on speaker diarization systems and approaches , 2012, Speech Commun..

[6]  J. Andrew Bagnell,et al.  Efficient Reductions for Imitation Learning , 2010, AISTATS.

[7]  Jean-Luc Gauvain,et al.  Multistage speaker diarization of broadcast news , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Mickael Rouvier,et al.  An open-source state-of-the-art toolbox for broadcast news diarization , 2013, INTERSPEECH.

[10]  Olivier Galibert,et al.  The ETAPE corpus for the evaluation of speech-based TV content processing in the French language , 2012, LREC.

[11]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Daniel Marcu,et al.  Learning as search optimization: approximate large margin methods for structured prediction , 2005, ICML.

[13]  François Yvon,et al.  Structured prediction for speaker identification in TV series , 2015, INTERSPEECH.

[14]  Guillaume Wisniewski,et al.  PanParser: a Modular Implementation for Efficient Transition-Based Dependency Parsing , 2018, Prague Bull. Math. Linguistics.

[15]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[16]  Thomas Fillon,et al.  YAAFE, an Easy to Use and Efficient Audio Feature Extraction Software , 2010, ISMIR.

[17]  Brian Roark,et al.  Incremental Parsing with the Perceptron Algorithm , 2004, ACL.

[18]  Shoei Sato,et al.  Low-latency speaker diarization based on Bayesian information criterion with multiple phoneme classes , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Joakim Nivre,et al.  Training Deterministic Parsers with Non-Deterministic Oracles , 2013, TACL.

[20]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[21]  Mauro Cettolo Segmentation, classification and clustering of an Italian broadcast news corpus , 2000 .

[22]  Jean-Luc Gauvain,et al.  Spoken Language Identification Using LSTM-Based Angular Proximity , 2017, INTERSPEECH.

[23]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[24]  Satoshi Nakamura,et al.  Improved novelty detection for online GMM based speaker diarization , 2008, INTERSPEECH.

[25]  Olivier Galibert,et al.  The REPERE Corpus : a multimodal corpus for person recognition , 2012, LREC.

[26]  Hervé Bredin,et al.  TristouNet: Triplet loss for speaker turn embedding , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Joakim Nivre,et al.  Algorithms for Deterministic Incremental Dependency Parsing , 2008, CL.

[28]  Jean-Luc Gauvain,et al.  Partitioning and transcription of broadcast news data , 1998, ICSLP.