The I4U Mega Fusion and Collaboration for NIST Speaker Recognition Evaluation 2016

The 2016 speaker recognition evaluation (SRE'16) is the latest edition in the series of benchmarking events conducted by the National Institute of Standards and Technology (NIST). I4U is a joint entry to SRE'16 as the result from the collaboration and active exchange of information among researchers from sixteen Institutes and Universities across 4 continents. The joint submission and several of its 32 subsystems were among top-performing systems. A lot of efforts have been devoted to two major challenges, namely, unlabeled training data and dataset shift from Switchboard-Mixer to the new Call My Net dataset. This paper summarizes the lessons learned, presents our shared view from the sixteen research groups on recent advances, major paradigm shift, and common tool chain used in speaker recognition as we have witnessed in SRE'16. More importantly, we look into the intriguing question of fusing a large ensemble of subsystems and the potential benefit of large-scale collaboration .

[1]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[2]  M. Todisco,et al.  Further Optimisations of Constant Q Cepstral Processing for Integrated Utterance Verification and Text-Dependent Speaker Verification , 2016 .

[3]  Themos Stafylakis,et al.  PLDA for speaker verification with utterances of arbitrary duration , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Dong Yu,et al.  Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.

[5]  Bin Ma,et al.  Multi-session PLDA scoring of i-vector for partially open-set speaker detection , 2013, INTERSPEECH.

[6]  John H. L. Hansen,et al.  The I4U Submission to the 2012 NIST Speaker Recognition Evaluation , 2012 .

[7]  Themos Stafylakis,et al.  JFA-based front ends for speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  John H. L. Hansen,et al.  I4u submission to NIST SRE 2012: a large-scale collaborative effort for noise-robust speaker verification , 2013, INTERSPEECH.

[9]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[10]  John S. Garofolo,et al.  NIST Speech Processing Evaluations: LVCSR, Speaker Recognition, Language Recognition , 2007 .

[11]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Douglas A. Reynolds,et al.  Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[13]  Haizhou Li,et al.  ALIZE 3.0 - open source toolkit for state-of-the-art speaker recognition , 2013, INTERSPEECH.

[14]  Richard M. Stern,et al.  Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[16]  Gerhard Tutz,et al.  The OSCAR for Generalized Linear Models , 2011 .

[17]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[18]  Hagai Aronowitz Compensating Inter-Dataset Variability in PLDA Hyper-Parameters for Robust Speaker Recognition , 2014, Odyssey.

[19]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[20]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[22]  Kong-Aik Lee,et al.  An extensible speaker identification sidekit in Python , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Chng Eng Siong,et al.  A comparative study of BNF and DNN multilingual training on cross-lingual low-resource speech recognition , 2015, INTERSPEECH.

[24]  Tomi Kinnunen,et al.  Further optimisations of constant Q cepstral processing for integrated utterance and text-dependent speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[25]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[26]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[27]  John H. L. Hansen,et al.  i-Vector/PLDA speaker recognition using support vectors with discriminant analysis , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Larry P. Heck,et al.  MSR Identity Toolbox v1.0: A MATLAB Toolbox for Speaker Recognition Research , 2013 .

[29]  Martin Karafiát,et al.  Adaptation of multilingual stacked bottle-neck neural network structure for new language , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Man-Wai Mak,et al.  A study of voice activity detection techniques for NIST speaker recognition evaluations , 2014, Comput. Speech Lang..

[31]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Tomi Kinnunen,et al.  A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Bin Ma,et al.  The 2015 NIST Language Recognition Evaluation: The Shared View of I2R, Fantastic4 and SingaMS , 2016, INTERSPEECH.

[34]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[35]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[36]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[37]  Sébastien Marcel,et al.  Bob: a free signal processing and machine learning toolbox for researchers , 2012, ACM Multimedia.

[38]  Daniel Garcia-Romero,et al.  Time delay deep neural network-based universal background models for speaker recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).