Speaker Diarization through Waveform and Neural Net

This paper presents an approach to the speaker diarization problem based on speech local waveform analysis. We assume that the recorded sound scene consists of a known number of sources and that the single microphone is utilized for recording. The research goal is to develop an algorithm for speaker diarization in online mode. The most significant attention is paid to limiting computer resources when solving the problem. We suppose that the speech file is already segmented so that any segment belongs to a single speaker. Our method is as follows. We divide each part into non-overlapping fragments of the constant length and change any sample in the piece to its absolute value. A particular technique is used to choose a threshold value Thr. After that, we select the portions of the fragments that exceed Thr and implement coding to describe the source signal's revealed parts as normalized cumulative sums containing the same number of items. These sums are used as input vectors for two types of neural networks. For comparison, we also developed a simple algorithm that does not leverage the neural net but fits the problem. The experiment shows that the end-to-end neural classification of the fragments brings acceptable results.