Pyannote.Audio: Neural Building Blocks for Speaker Diarization

We introduce pyannote.audio, an open-source toolkit written in Python for speaker diarization. Based on PyTorch machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. pyannote.audio also comes with pre-trained models covering a wide range of domains for voice activity detection, speaker change detection, overlapped speech detection, and speaker embedding – reaching state-of-the-art performance for most of them.

[1]  Daniel P. W. Ellis,et al.  librosa/librosa: 0.6.0 , 2018 .

[2]  Delphine Charlet,et al.  Impact of overlapping speech detection on speaker diarization for broadcast news and debates , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Claude Barras,et al.  Speaker Change Detection in Broadcast TV Using Bidirectional Long Short-Term Memory Networks , 2017, INTERSPEECH.

[4]  Jean-François Bonastre,et al.  ALIZE, a free toolkit for speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[5]  Jean-Luc Gauvain,et al.  Optimization of RNN-Based Speech Activity Detection , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Olivier Galibert,et al.  The ETAPE corpus for the evaluation of speech-based TV content processing in the French language , 2012, LREC.

[7]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[8]  Sanjeev Khudanpur,et al.  Characterizing Performance of Speaker Diarization Systems on Far-Field Speech Using Standard Methods , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Jean Carletta,et al.  Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus , 2007, Lang. Resour. Evaluation.

[10]  Hervé Bredin,et al.  pyannote.metrics: A Toolkit for Reproducible Evaluation, Diagnostic, and Error Analysis of Speaker Diarization Systems , 2017, INTERSPEECH.

[11]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[12]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[13]  Marie Kunesová,et al.  Detection of Overlapping Speech for the Purposes of Speaker Diarization , 2019, SPECOM.

[14]  Kenneth Ward Church,et al.  The Second DIHARD Diarization Challenge: Dataset, task, and baselines , 2019, INTERSPEECH.

[15]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Sylvain Meignier,et al.  S4D: Speaker Diarization Toolkit in Python , 2018, INTERSPEECH.

[17]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[18]  Theodoros Giannakopoulos pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis , 2015, PloS one.

[19]  Claude Barras,et al.  Neural Speech Turn Segmentation and Affinity Propagation for Speaker Diarization , 2018, INTERSPEECH.

[20]  Jan Cernocký,et al.  Bayesian HMM Based x-Vector Clustering for Speaker Diarization , 2019, INTERSPEECH.

[21]  Leibny Paola García-Perera,et al.  Overlap-Aware Diarization: Resegmentation Using Neural End-to-End Overlapped Speech Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).