Multisv: Dataset for Far-Field Multi-Channel Speaker Verification

Motivated by unconsolidated data situation and the lack of a standard benchmark in the field, we complement our previous efforts and present a comprehensive corpus designed for training and evaluating text-independent multi-channel speaker verification systems. It can be readily used also for experiments with dereverberation, denoising, and speech enhancement. We tackled the ever-present problem of the lack of multi-channel training data by utilizing data simulation on top of clean parts of the Voxceleb dataset. The development and evaluation trials are based on a retransmitted Voices Obscured in Complex Environmental Settings (VOiCES) corpus, which we modified to provide multi-channel trials. We publish full recipes that create the dataset from public sources as the MultiSV corpus, and we provide results with two of our multi-channel speaker verification systems with neural network-based beamforming based either on predicting ideal binary masks or the more recent Conv-TasNet.

[1]  Menglong Xu,et al.  Libri-adhoc40: A dataset collected from synchronized ad-hoc microphone arrays , 2021 .

[2]  Zhong-Qiu Wang,et al.  Deep Learning Based Multi-Channel Speaker Recognition in Noisy and Reverberant Environments , 2019, INTERSPEECH.

[3]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Joon-Young Yang,et al.  Joint Optimization of Neural Acoustic Beamforming and Dereverberation with x-Vectors for Robust Speaker Verification , 2019, INTERSPEECH.

[5]  Colleen Richey,et al.  The VOiCES from a Distance Challenge 2019 Evaluation Plan , 2019, ArXiv.

[6]  Xavier Bresson,et al.  FMA: A Dataset for Music Analysis , 2016, ISMIR.

[7]  Shi-Xiong Zhang,et al.  Multi-Channel Speaker Verification for Single and Multi-Talker Speech , 2021, Interspeech 2021.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[10]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Shrikanth Narayanan,et al.  The INTERSPEECH 2020 Far-Field Speaker Verification Challenge , 2020, INTERSPEECH.

[12]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[14]  Oldřich Plchot,et al.  Speaker Verification with Application-Aware Beamforming , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[15]  Colleen Richey,et al.  Voices Obscured in Complex Environmental Settings (VOICES) corpus , 2018, INTERSPEECH.

[16]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[17]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[18]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[19]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[20]  Joon Son Chung,et al.  VoxSRC 2020: The Second VoxCeleb Speaker Recognition Challenge , 2020, ArXiv.

[21]  Shinji Watanabe,et al.  Speaker Recognition Benchmark Using the CHiME-5 Corpus , 2019, INTERSPEECH.

[22]  DeLiang Wang,et al.  Robust Speaker Recognition Based on Single-Channel and Multi-Channel Speech Enhancement , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Jian Cheng,et al.  Additive Margin Softmax for Face Verification , 2018, IEEE Signal Processing Letters.

[24]  Niko Brümmer,et al.  Measuring, refining and calibrating speaker and language information extracted from speech , 2010 .

[25]  Douglas A. Reynolds,et al.  Two decades of speaker recognition evaluation at the national institute of standards and technology , 2020, Comput. Speech Lang..

[26]  J. Capon High-resolution frequency-wavenumber spectrum analysis , 1969 .

[27]  Jan Cernocký,et al.  Utilizing VOiCES Dataset for Multichannel Speaker Verification with Beamforming , 2020, Odyssey.

[28]  Ming Li,et al.  Multi-Channel Training for End-to-End Speaker Recognition Under Reverberant and Noisy Environment , 2019, INTERSPEECH.

[29]  Shanzheng Guan,et al.  Attention-based multi-channel speaker verification with ad-hoc microphone arrays , 2021, 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[30]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[31]  Pavel Matějka,et al.  Analysis of BUT Submission in Far-Field Scenarios of VOiCES 2019 Challenge , 2019, INTERSPEECH.

[32]  Ming Li,et al.  HI-MIA: A Far-Field Text-Dependent Speaker Verification Database and the Baselines , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).