Average Window Smoothing for an Indonesian Language Online Speaker Identification System

Online speaker diarization and identification is the process of determining ‘who spoke when’ given an ongoing conversation or audio stream, in contrast to the offline scenario where the conversation has concluded and the entire file is available. Online identification is required when speaker identities need to be determined during or directly after speech, for instance in the automatic transcription of live broadcasts and of some meetings. The process of constructing an Indonesian language online speaker identification system is explored, from design, corpus development, to experimentation. The system conducts speaker identification directly on low-energy separated segments and employs a rolling window of time-weighted average likelihoods to improve accuracy, resulting in a system with a latency of one speaker segment for predictions. Experimentation against a standard baseline offline system resulted in speaker error rates (SER) of 25.5% and 18.5% for the proposed online and baseline offline systems, respectively. The latency of the proposed system is 0.21 times the length of input segments, compared to 1.10 for the baseline system.

[1]  Mickael Rouvier,et al.  An open-source state-of-the-art toolbox for broadcast news diarization , 2013, INTERSPEECH.

[2]  Christian Wellekens,et al.  Audio data indexing: Use of second-order statistics for speaker-based segmentation , 1999, Proceedings IEEE International Conference on Multimedia Computing and Systems.

[3]  Gerald Friedland Using a GPU, Online Diarization – Offline Diarization , 2012 .

[4]  Dessi Puji Lestari,et al.  Transcriber: An Android application that automates the transcription of interviews in Indonesian , 2017, 2017 International Conference on Advanced Informatics, Concepts, Theory, and Applications (ICAICTA).

[5]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[6]  Daben Liu,et al.  Fast speaker change detection for broadcast news transcription and indexing , 1999, EUROSPEECH.

[7]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[8]  Sylvain Meignier,et al.  LIUM SPKDIARIZATION: AN OPEN SOURCE TOOLKIT FOR DIARIZATION , 2010 .

[9]  Amit Srivastava,et al.  Online speaker adaptation and tracking for real-time speech recognition , 2005, INTERSPEECH.

[10]  M. A. Siegler,et al.  Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[11]  Tanel Alumäe,et al.  Full-duplex Speech-to-text System for Estonian , 2014, Baltic HLT.

[12]  Daben Liu,et al.  Online speaker clustering , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..