论文信息 - The JHU-MIT System Description for NIST SRE18

The JHU-MIT System Description for NIST SRE18

This document represents the SRE18 system description for the joint effort of the teams at JHU-CLSP, JHU-HLTCOE, MIT Lincoln Labs., MIT CSAIL and LSE-EPITA. All the developed systems consisted of Neural network/i-vector embeddings with some flavor of PLDA back-end. The systems were tailored to the video (VAST) condition or to the telephone condition (CMN2). For VAST, the primary system was a fusion of a 16 kHz TDNN x-vector, 16 kHz factorized TDNN x-vector, 8 kHz TDNN x-vector and 8 kHz ResNet34-Attention embedding. For CMN2, the primary was a fusion of two TDNN x-vectors and ResNet34-Attention embedding. For development in the VAST condition, we used the SITW eval core-multi dataset where we obtained Cprimary=0.105. For telephone, we used the SRE18 dev CMN2 where we obtained Cprimary=0.256. The contrastive submissions included the best single system (JHUHLTCOE, SITW Cp=0.137, CMN2 Cp=0.312); and the best fusions of 1, 2, 3,... systems from the JHU-CLSP-MIT sub-team.

[1] Shuai Wang,et al. Angular Softmax for Short-Duration Text-independent Speaker Verification , 2018, INTERSPEECH.

[2] Sanjeev Khudanpur,et al. Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[3] Shinji Watanabe,et al. Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge , 2018, INTERSPEECH.

[4] Lukás Burget,et al. Fast variational Bayes for heavy-tailed PLDA applied to i-vectors and x-vectors , 2018, INTERSPEECH.

[5] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Robert B. Dunn,et al. Improving Statistical Model-Based Speech Enhancement with Deep Neural Networks , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[7] Alan McCree,et al. Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Yiming Wang,et al. Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks , 2018, INTERSPEECH.

[9] Ming Li,et al. Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System , 2018, Odyssey.

[10] Niko Brümmer,et al. The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF , 2013, ArXiv.

[11] Niko Brümmer,et al. Unsupervised Domain Adaptation for I-Vector Speaker Recognition , 2014, Odyssey.

[12] Hao Tang,et al. Frame-Level Speaker Embeddings for Text-Independent Speaker Recognition and Analysis of End-to-End Model , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[13] Ming Li,et al. A Novel Learnable Dictionary Encoding Layer for End-to-End Language Identification , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Bhiksha Raj,et al. SphereFace: Deep Hypersphere Embedding for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Daniel Garcia-Romero,et al. Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.