论文信息 - Pyannote.Audio: Neural Building Blocks for Speaker Diarization

Pyannote.Audio: Neural Building Blocks for Speaker Diarization

We introduce pyannote.audio, an open-source toolkit written in Python for speaker diarization. Based on PyTorch machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. pyannote.audio also comes with pre-trained models covering a wide range of domains for voice activity detection, speaker change detection, overlapped speech detection, and speaker embedding – reaching state-of-the-art performance for most of them.

[1] Daniel P. W. Ellis,et al. librosa/librosa: 0.6.0 , 2018 .

[2] Delphine Charlet,et al. Impact of overlapping speech detection on speaker diarization for broadcast news and debates , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3] Claude Barras,et al. Speaker Change Detection in Broadcast TV Using Bidirectional Long Short-Term Memory Networks , 2017, INTERSPEECH.

[4] Jean-François Bonastre,et al. ALIZE, a free toolkit for speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[5] Jean-Luc Gauvain,et al. Optimization of RNN-Based Speech Activity Detection , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6] Olivier Galibert,et al. The ETAPE corpus for the evaluation of speech-based TV content processing in the French language , 2012, LREC.

[7] Yoshua Bengio,et al. Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[8] Sanjeev Khudanpur,et al. Characterizing Performance of Speaker Diarization Systems on Far-Field Speech Using Standard Methods , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Jean Carletta,et al. Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus , 2007, Lang. Resour. Evaluation.

[10] Hervé Bredin,et al. pyannote.metrics: A Toolkit for Reproducible Evaluation, Diagnostic, and Error Analysis of Speaker Diarization Systems , 2017, INTERSPEECH.

[11] Joon Son Chung,et al. VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[12] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[13] Marie Kunesová,et al. Detection of Overlapping Speech for the Purposes of Speaker Diarization , 2019, SPECOM.

[14] Kenneth Ward Church,et al. The Second DIHARD Diarization Challenge: Dataset, task, and baselines , 2019, INTERSPEECH.

[15] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Sylvain Meignier,et al. S4D: Speaker Diarization Toolkit in Python , 2018, INTERSPEECH.

[17] Daniel Povey,et al. MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[18] Theodoros Giannakopoulos. pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis , 2015, PloS one.

[19] Claude Barras,et al. Neural Speech Turn Segmentation and Affinity Propagation for Speaker Diarization , 2018, INTERSPEECH.

[20] Jan Cernocký,et al. Bayesian HMM Based x-Vector Clustering for Speaker Diarization , 2019, INTERSPEECH.

[21] Leibny Paola García-Perera,et al. Overlap-Aware Diarization: Resegmentation Using Neural End-to-End Overlapped Speech Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).