论文信息 - The DKU System for the Speaker Recognition Task of the 2019 VOiCES from a Distance Challenge

The DKU System for the Speaker Recognition Task of the 2019 VOiCES from a Distance Challenge

In this paper, we present the DKU system for the speaker recognition task of the VOiCES from a distance challenge 2019. We investigate the whole system pipeline for the far-field speaker verification, including data pre-processing, short-term spectral feature representation, utterance-level speaker modeling, back-end scoring, and score normalization. Our best single system employs a residual neural network trained with angular softmax loss. Also, the weighted prediction error algorithms can further improve performance. It achieves 0.3668 minDCF and 5.58% EER on the evaluation set by using a simple cosine similarity scoring. Finally, the submitted primary system obtains 0.3532 minDCF and 4.96% EER on the evaluation set.

[1] Ming Li,et al. Insights in-to-End Learning Scheme for Language Identification , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] John McDonough,et al. Distant Speech Recognition , 2009 .

[3] Longbiao Wang,et al. DNN-Based Amplitude and Phase Feature Enhancement for Noise Robust Speaker Identification , 2016, INTERSPEECH.

[4] Ming Li,et al. Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System , 2018, Odyssey.

[5] Peter F. Assmann,et al. The Perception of Speech Under Adverse Conditions , 2004 .

[6] Ho-Sub Yoon,et al. Text-Independent Speaker Identification using Soft Channel Selection in Home Robot Environments , 2008, IEEE Transactions on Consumer Electronics.

[7] DeLiang Wang,et al. Robust Speaker Identification in Noisy and Reverberant Conditions , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8] Tiago H. Falk,et al. Modulation Spectral Features for Robust Far-Field Speaker Identification , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[9] Biing-Hwang Juang,et al. Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[10] Tanja Schultz,et al. Speaker identification with distant microphone speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Patrick Kenny,et al. Speaker Verification in Mismatched Conditions with Frustratingly Easy Domain Adaptation , 2018, Odyssey.

[13] R. Patterson,et al. Complex Sounds and Auditory Images , 1992 .

[14] Reinhold Häb-Umbach,et al. Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15] Daniel Garcia-Romero,et al. Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Colleen Richey,et al. Robust Speaker Recognition from Distant Speech under Real Reverberant Environments Using Speaker Embeddings , 2018, INTERSPEECH.

[17] Tanja Schultz,et al. Far-Field Speaker Recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[18] Joon Son Chung,et al. VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[19] Wendi B. Heinzelman,et al. Front-end speech enhancement for commercial speaker verification systems , 2018, Speech Commun..

[20] Boaz Rafaely,et al. Reverberation matching for speaker recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21] John H. L. Hansen,et al. Blind Spectral Weighting for Robust Speaker Identification under Reverberation Mismatch , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22] Niko Brümmer,et al. The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF , 2013, ArXiv.

[23] Daniel Garcia-Romero,et al. Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[24] Lukás Burget,et al. Analysis of Score Normalization in Multilingual Speaker Recognition , 2017, INTERSPEECH.

[25] Kate Saenko,et al. Return of Frustratingly Easy Domain Adaptation , 2015, AAAI.

[26] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[27] Douglas D. O'Shaughnessy,et al. Improving the performance of far-field speaker verification using multi-condition training: the case of GMM-UBM and i-vector systems , 2014, INTERSPEECH.

[28] Joon Son Chung,et al. VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[29] John H. L. Hansen,et al. Hilbert envelope based features for robust speaker identification under reverberant mismatched conditions , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Colleen Richey,et al. The VOiCES from a Distance Challenge 2019 Evaluation Plan , 2019, ArXiv.

[31] Colleen Richey,et al. Voices Obscured in Complex Environmental Settings (VOICES) corpus , 2018, INTERSPEECH.

[32] Zheng-Hua Tan,et al. Speech enhancement using Long Short-Term Memory based recurrent Neural Networks for noise robust Speaker Verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[33] Ming Li,et al. Analysis of Length Normalization in End-to-End Speaker Verification System , 2018, INTERSPEECH.

[34] Daniel Povey,et al. MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[35] Richard M. Stern,et al. Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[36] Reinhold Häb-Umbach,et al. Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37] Bhiksha Raj,et al. SphereFace: Deep Hypersphere Embedding for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Ivan Dokmanic,et al. Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).