Information Bottleneck Based Percussion Instrument Diarization System for Taniavartanam Segments of Carnatic Music Concerts

An approach to diarize taniavartanam segments of a Carnatic music concert is proposed in this paper. Information bottleneck (IB) based approach used for speaker diarization is applied for this task. IB system initializes the segments to be clustered uniformly with fixed duration. The issue with diarization of percussion instruments in taniavartanam is that the stroke rate varies highly across the segments. It can double or even quadruple within a short duration, thus leading to variable information rate in different segments. To address this issue, the IB system is modified to use the stroke rate information to divide the audio into segments of varying durations. These varying duration segments are then clustered using the IB approach which is then followed by Kullback-Leibler hidden Markov model (KLHMM) based realignment of the instrument boundaries. Performance of the conventional IB system and the proposed system is evaluated on standard Carnatic music dataset. The proposed technique shows a best case absolute improvement of 8.2% over the conventional IB based system in terms of diarization error rate.

[1]  SARALA PADI,et al.  Segmentation of continuous audio recordings of Carnatic music concerts into items for archival , 2018 .

[2]  Jonathan G. Fiscus,et al.  Multimodal Technologies for Perception of Humans, International Evaluation Workshops CLEAR 2007 and RT 2007, Baltimore, MD, USA, May 8-11, 2007, Revised Selected Papers , 2008, CLEAR.

[3]  Fabio Valente,et al.  KL realignment for speaker diarization with multiple feature streams , 2009, INTERSPEECH.

[4]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[5]  Alan McCree,et al.  Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Andreas Stolcke,et al.  Artificial neural network features for speaker diarization , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[7]  Hema A. Murthy,et al.  Musical onset detection on carnatic percussion instruments , 2015, 2015 Twenty First National Conference on Communications (NCC).

[8]  Fabio Valente,et al.  An Information Theoretic Combination of MFCC and TDOA Features for Speaker Diarization , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Hervé Bourlard,et al.  Filterbank slope based features for speaker diarization , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Hervé Bredin,et al.  TristouNet: Triplet loss for speaker turn embedding , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Hema A. Murthy,et al.  Two-Pass IB Based Speaker Diarization System Using Meeting-Specific ANN Based Features , 2016, INTERSPEECH.

[12]  Hema A Murthy,et al.  Mridangam Artist Identification from Taniavartanam Audio , 2018, 2018 Twenty Fourth National Conference on Communications (NCC).

[13]  Jitendra Ajmera,et al.  A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[14]  Fabio Valente,et al.  An Information Theoretic Approach to Speaker Diarization of Meeting Data , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Hema A. Murthy,et al.  Akshara transcription of mrudangam strokes in Carnatic music , 2015, 2015 Twenty First National Conference on Communications (NCC).

[17]  Hervé Bourlard,et al.  Information bottleneck based speaker diarization of meetings using non-speech as side information , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.