论文信息 - BUT Text-Dependent Speaker Verification System for SdSV Challenge 2020

BUT Text-Dependent Speaker Verification System for SdSV Challenge 2020

In this paper, we present the winning BUT submission for the text-dependent task of the SdSV challenge 2020. Given the large amount of training data available in this challenge, we explore successful techniques from text-independent systems in the text-dependent scenario. In particular, we trained x-vector extractors on both in-domain and out-of-domain datasets and combine them with i-vectors trained on concatenated MFCCs and bottleneck features, which have proven effective for the text-dependent scenario. Moreover, we proposed the use of phrase-dependent PLDA backend for scoring and its combination with a simple phrase recognizer, which brings up to 63% relative improvement on our development set with respect to using standard PLDA. Finally, we combine our different i-vector and x-vector based systems using a simple linear logistic regression score level fusion, which provides 28% relative improvement on the evaluation set with respect to our best single system.

[1] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[2] Moshe Wasserblat,et al. How to Deal with Multiple-Targets in Speaker Identification Systems? , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[3] Yiming Wang,et al. Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[4] Lukás Burget,et al. A Multi Purpose and Large Scale Speech Corpus in Persian and English for Speaker and Speech Recognition: The Deepmine Database , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[5] Lukás Burget,et al. Investigation into bottle-neck features for meeting speech recognition , 2009, INTERSPEECH.

[6] Shuai Wang,et al. BUT System Description to VoxCeleb Speaker Recognition Challenge 2019 , 2019, ArXiv.

[7] Lukás Burget,et al. Analysis of Score Normalization in Multilingual Speaker Recognition , 2017, INTERSPEECH.

[8] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Daniel Povey,et al. Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification , 2018, INTERSPEECH.

[10] Jahangir Alam,et al. Short-duration Speaker Verification (SdSV) Challenge 2020: the Challenge Evaluation Plan , 2019, ArXiv.

[11] Tomi Kinnunen,et al. INTERSPEECH 2013 14thAnnual Conference of the International Speech Communication Association , 2013, Interspeech 2015.

[12] Sanjeev Khudanpur,et al. A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Yiming Wang,et al. Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks , 2018, INTERSPEECH.

[14] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Lukás Burget,et al. Analysis and Optimization of Bottleneck Features for Speaker Recognition , 2016, Odyssey.

[16] Hossein Sameti,et al. DeepMine Speech Processing Database: Text-Dependent and Independent Speaker Verification and Speech Recognition in Persian and English , 2018, Odyssey.

[17] Lukás Burget,et al. Language Recognition in iVectors Space , 2011, INTERSPEECH.

[18] Joon Son Chung,et al. VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[19] Sanjeev Khudanpur,et al. State-of-the-Art Speaker Recognition for Telephone and Video Speech: The JHU-MIT Submission for NIST SRE18 , 2019, INTERSPEECH.

[20] Joon Son Chung,et al. VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[21] Lukás Burget,et al. Fast variational Bayes for heavy-tailed PLDA applied to i-vectors and x-vectors , 2018, INTERSPEECH.

[22] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[23] Douglas A. Reynolds,et al. A unified deep neural network for speaker and language recognition , 2015, INTERSPEECH.

[24] Douglas E. Sturim,et al. Speaker adaptive cohort selection for Tnorm in text-independent speaker verification , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[25] Martin Karafiát,et al. Adaptation of multilingual stacked bottle-neck neural network structure for new language , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Ruhi Sarikaya,et al. Bottleneck features for speaker recognition , 2012, Odyssey.

[27] Patrick Kenny,et al. Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[28] Lukás Burget,et al. HMM-Based Phrase-Independent i-Vector Extractor for Text-Dependent Speaker Verification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.