The JHU-MIT System Description for NIST SRE18

This document represents the SRE18 system description for the joint effort of the teams at JHU-CLSP, JHU-HLTCOE, MIT Lincoln Labs., MIT CSAIL and LSE-EPITA. All the developed systems consisted of Neural network/i-vector embeddings with some flavor of PLDA back-end. The systems were tailored to the video (VAST) condition or to the telephone condition (CMN2). For VAST, the primary system was a fusion of a 16 kHz TDNN x-vector, 16 kHz factorized TDNN x-vector, 8 kHz TDNN x-vector and 8 kHz ResNet34-Attention embedding. For CMN2, the primary was a fusion of two TDNN x-vectors and ResNet34-Attention embedding. For development in the VAST condition, we used the SITW eval core-multi dataset where we obtained Cprimary=0.105. For telephone, we used the SRE18 dev CMN2 where we obtained Cprimary=0.256. The contrastive submissions included the best single system (JHUHLTCOE, SITW Cp=0.137, CMN2 Cp=0.312); and the best fusions of 1, 2, 3,... systems from the JHU-CLSP-MIT sub-team.

[1]  Shuai Wang,et al.  Angular Softmax for Short-Duration Text-independent Speaker Verification , 2018, INTERSPEECH.

[2]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[3]  Shinji Watanabe,et al.  Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge , 2018, INTERSPEECH.

[4]  Lukás Burget,et al.  Fast variational Bayes for heavy-tailed PLDA applied to i-vectors and x-vectors , 2018, INTERSPEECH.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Robert B. Dunn,et al.  Improving Statistical Model-Based Speech Enhancement with Deep Neural Networks , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[7]  Alan McCree,et al.  Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Yiming Wang,et al.  Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks , 2018, INTERSPEECH.

[9]  Ming Li,et al.  Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System , 2018, Odyssey.

[10]  Niko Brümmer,et al.  The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF , 2013, ArXiv.

[11]  Niko Brümmer,et al.  Unsupervised Domain Adaptation for I-Vector Speaker Recognition , 2014, Odyssey.

[12]  Hao Tang,et al.  Frame-Level Speaker Embeddings for Text-Independent Speaker Recognition and Analysis of End-to-End Model , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[13]  Ming Li,et al.  A Novel Learnable Dictionary Encoding Layer for End-to-End Language Identification , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Bhiksha Raj,et al.  SphereFace: Deep Hypersphere Embedding for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.