13 years of speaker recognition research at BUT, with longitudinal analysis of NIST SRE

Abstract In this paper, we present a brief history and a “longitudinal study” of all important milestone modelling techniques used in text independent speaker recognition since Brno University of Technology (BUT) first participated in the NIST Speaker Recognition Evaluation (SRE) in 2006—GMM MAP, GMM MAP with eigen-channel adaptation, Joint Factor Analysis, i-vector and DNN embedding (x-vector). To emphasize the historical context, the techniques are evaluated on all NIST SRE sets since 2004 on a time-machine principle, i.e. a system is always trained using all data available up till the year of evaluation. Moreover, as user-contributed audiovisual content dominates nowadays’ Internet, we representatively include the Speakers In The Wild (SITW) and VOiCES challenge datasets in the evaluation of our systems. Not only we present a comparison of the modelling techniques, but we also show the effect of sampling frequency.

[1]  Alan McCree,et al.  Supervised domain adaptation for I-vector based speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Andreas Stolcke,et al.  THE SRI NIST 2008 speaker recognition evaluation system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Spyridon Matsoukas,et al.  Domain adaptation via within-class covariance correction in I-vector based speaker recognition systems , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Patrick Kenny,et al.  New MAP estimators for speaker recognition , 2003, INTERSPEECH.

[5]  Martin Karafiát,et al.  The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[6]  Pavel Matejka,et al.  On the use of X-vectors for Robust Speaker Recognition , 2018, Odyssey.

[7]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[8]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[9]  Roland Auckenthaler,et al.  Score Normalization for Text-Independent Speaker Verification Systems , 2000, Digit. Signal Process..

[10]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[11]  Ondrej Glembek Optimalizace modelování gaussovských směsí v podprostorech a jejich skórování v rozpoznávání mluvčího ; Optimization of Gaussian Mixture Subspace Models and Related Scoring Algorithms in Speaker Verification , 2012 .

[12]  Mireia Díez,et al.  Speaker Diarization based on Bayesian HMM with Eigenvoice Priors , 2018, Odyssey.

[13]  Lukás Burget,et al.  Analysis and Optimization of Bottleneck Features for Speaker Recognition , 2016, Odyssey.

[14]  William M. Campbell,et al.  Channel compensation for SVM speaker recognition , 2004, Odyssey.

[15]  Lukás Burget,et al.  Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[17]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[18]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[19]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[20]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[21]  Lukás Burget,et al.  Analysis of Speaker Recognition Systems in Realistic Scenarios of the SITW 2016 Challenge , 2016, INTERSPEECH.

[22]  Lukás Burget,et al.  Analysis of DNN approaches to speaker identification , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Colleen Richey,et al.  The VOiCES from a Distance Challenge 2019 Evaluation Plan , 2019, ArXiv.

[24]  Patrick Kenny,et al.  Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification , 2009, INTERSPEECH.

[25]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[27]  Lukás Burget,et al.  Speaker Verification Using End-to-end Adversarial Language Adaptation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  David A. van Leeuwen,et al.  Fusion of Heterogeneous Speaker Recognition Systems in the STBU Submission for the NIST Speaker Recognition Evaluation 2006 , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Lukás Burget,et al.  Learning Document Representations Using Subspace Multinomial Model , 2016, INTERSPEECH.

[30]  Spyridon Matsoukas,et al.  Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[31]  Lukás Burget,et al.  Analysis of Score Normalization in Multilingual Speaker Recognition , 2017, INTERSPEECH.

[32]  Pavel Matějka,et al.  Analysis of BUT Submission in Far-Field Scenarios of VOiCES 2019 Challenge , 2019, INTERSPEECH.

[33]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[34]  Andreas G. Andreou,et al.  Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition , 1997 .

[35]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[36]  Pietro Laface,et al.  Pairwise Discriminative Speaker Verification in the ${\rm I}$-Vector Space , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Sanjeev Khudanpur,et al.  Speaker Recognition for Multi-speaker Conversations Using X-vectors , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Mireia Díez,et al.  End-to-End DNN Based Speaker Recognition Inspired by I-Vector and PLDA , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Vincent M. Stanford,et al.  The 2021 NIST Speaker Recognition Evaluation , 2022, Odyssey.

[40]  Pietro Laface,et al.  Fast discriminative speaker verification in the i-vector space , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Aaron Lawson,et al.  The Speakers in the Wild (SITW) Speaker Recognition Database , 2016, INTERSPEECH.

[42]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[43]  Pavel Matejka,et al.  Dereverberation and Beamforming in Robust Far-Field Speaker Recognition , 2018, INTERSPEECH.

[44]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[45]  Lukás Burget,et al.  Discriminatively trained Probabilistic Linear Discriminant Analysis for speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Lukás Burget,et al.  Analysis of Feature Extraction and Channel Compensation in a GMM Speaker Recognition System , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[47]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[48]  Lukás Burget,et al.  Investigation into variants of joint factor analysis for speaker recognition , 2009, INTERSPEECH.

[49]  Patrick Kenny,et al.  Disentangling speaker and channel effects in speaker verification , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[50]  Sanjeev Khudanpur,et al.  Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[51]  Andreas Stolcke,et al.  MLLR transforms as features in speaker recognition , 2005, INTERSPEECH.

[52]  Douglas A. Reynolds,et al.  Channel robust speaker verification via feature mapping , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[53]  Douglas E. Sturim,et al.  The MIT lincoln laboratory 2008 speaker recognition system , 2009, INTERSPEECH.

[54]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[55]  Lukás Burget,et al.  Discriminatively Trained i-vector Extractor for Speaker Verification , 2011, INTERSPEECH.

[56]  Lukás Burget,et al.  i-Vectors in Language Modeling: An Efficient Way of Domain Adaptation for Feed-Forward Models , 2018, INTERSPEECH.

[57]  Oldrich Plchot Rozšíření pro pravděpodobnostní lineární diskriminační analýzu v rozpoznávání mluvčího ; Extensions to Probabilistic Linear Discriminant Analysis for Speaker Recognition , 2014 .

[58]  Alvin F. Martin,et al.  NIST 2008 speaker recognition evaluation: performance across telephone and room microphone channels , 2009, INTERSPEECH.

[59]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60]  Lukás Burget,et al.  Simplification and optimization of i-vector extraction , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[61]  Pavel Matejka,et al.  Multilingual bottleneck features for language recognition , 2015, INTERSPEECH.

[62]  Niko Brümmer,et al.  Analysis and Description of ABC Submission to NIST SRE 2016 , 2017, INTERSPEECH.

[63]  Alvin F. Martin,et al.  The NIST 2010 speaker recognition evaluation , 2010, INTERSPEECH.

[64]  Lukás Burget,et al.  Comparison of scoring methods used in speaker recognition with Joint Factor Analysis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[65]  Hynek Hermansky,et al.  Developing a speaker identification system for the DARPA RATS project , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[66]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[67]  Eduardo Lleida,et al.  Unsupervised adaptation of PLDA by using variational Bayes methods , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[68]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[69]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[70]  Lukás Burget,et al.  Analysis of the DNN-based SRE systems in multi-language conditions , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).