Recurrent Neural Network Based Speaker Change Detection from Text Transcription Applied in Telephone Speaker Diarization System

In this paper, we propose a speaker change detection system based on lexical information from the transcribed speech. For this purpose, we applied a recurrent neural network to decide if there is an end of an utterance at the end of a spoken word. Our motivation is to use the transcription of the conversation as an additional feature for a speaker diarization system to refine the segmentation step to achieve better accuracy of the whole diarization system. We compare the proposed speaker change detection system based on transcription (text) with our previous system based on information from spectrogram (audio) and combine these two modalities to improve the results of diarization. We cut the conversation into segments according to the detected changes and represent them by an i-vector. We conducted experiments on the English part of the CallHome corpus. The results indicate improvement in speaker change detection (by 0.5% relatively) and also in speaker diarization (by 1% relatively) when both modalities are used.

[1]  Patrick Kenny,et al.  Experiments in speaker verification using factor analysis likelihood ratios , 2004, Odyssey.

[2]  Louis ten Bosch,et al.  Durational Aspects of Turn-Taking in Spontaneous Face-to-Face and Telephone Dialogues , 2004, TSD.

[3]  Vishwa Gupta Speaker change point detection using deep neural nets , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Thomas Fang Zheng,et al.  Speaker segmentation using deep speaker vectors for fast speaker change scenarios , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Daniel Garcia-Romero,et al.  Speaker diarization with plda i-vector scoring and unsupervised calibration , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[6]  Marie Kunesová,et al.  Experiments with Segmentation in an Online Speaker Diarization System , 2017, TSD.

[7]  Hervé Bredin,et al.  pyannote.metrics: A Toolkit for Reproducible Evaluation, Diagnostic, and Error Analysis of Speaker Diarization Systems , 2017, INTERSPEECH.

[8]  Marie Kunesová,et al.  Convolutional Neural Network in the Task of Speaker Change Detection , 2016, SPECOM.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  James R. Glass,et al.  Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Kai Yu,et al.  Generating and evaluating segmentations for automatic speech recognition of conversational telephone speech , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  David Pascual Serrano,et al.  Modelos animales de dolor neuropático , 2016 .

[13]  James R. Glass,et al.  Exploiting Intra-Conversation Variability for Speaker Diarization , 2011, INTERSPEECH.

[14]  Hervé Bredin,et al.  TristouNet: Triplet loss for speaker turn embedding , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Marek Hrúz,et al.  Convolutional Neural Network for speaker change detection in telephone speaker diarization system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Reinhold Häb-Umbach,et al.  A study of broadcast news audio stream segmentation and segment clustering , 1999, EUROSPEECH.

[18]  Ludek Müller,et al.  Speaker Diarization Using Convolutional Neural Network for Statistics Accumulation Refinement , 2017, INTERSPEECH.

[19]  Ludek Müller,et al.  Application of LSTM Neural Networks in Language Modelling , 2013, TSD.

[20]  Belkacem Fergani,et al.  Speaker diarization using one-class support vector machines , 2008, Speech Commun..

[21]  Ludek Müller,et al.  Robust Adaptation Techniques Dealing with Small Amount of Data , 2012, TSD.

[22]  Hervé Bourlard,et al.  Robust speaker change detection , 2004, IEEE Signal Processing Letters.

[23]  Mickael Rouvier,et al.  An open-source state-of-the-art toolbox for broadcast news diarization , 2013, INTERSPEECH.

[24]  Fabio Valente,et al.  Speaker diarization of meetings based on speaker role n-gram models , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  José A. R. Fonollosa,et al.  LSTM Neural Network-Based Speaker Segmentation Using Acoustic and Language Modelling , 2017, INTERSPEECH.

[26]  Alan McCree,et al.  Speaker diarization with i-vectors from DNN senone posteriors , 2015, INTERSPEECH.