A PLDA approach for language and text independent speaker recognition

A language independent PLDA training algorithm has been proposed to improve performance of text-independent speaker recognition under multilingual trial condition.The proposed approach take advantageous of multilingual utterances by bilingual speakers to improve speaker recognition in multilingual scenarios.Source normalization technique which was developed to compensate for speech-source-variation, offered superior performance in cross-language trial condition.The proposed solution can provide significant improvement for non-English trials which makes it an effective technique to adapt a speaker recognition system to a low-resource language. There are many factors affecting the variability of an i-vector extracted from a speech segment such as the acoustic content, segment duration, handset type and background noise. The language being spoken is one of the sources of variation which has received limited focus due to the lack of multilingual resources available. Consequently, the discrimination performance is much lower under multilingual trial condition. Standard session-compensation techniques such as Within-Class Covariance Normalization (WCCN), Linear Discriminant Analysis (LDA) and Probabilistic LDA (PLDA) cannot robustly compensate for language source of variation as the amount of data is limited to represent such variability. Source normalization technique which was developed to compensate for speech-source-variation, offered superior performance in cross-language trials by providing better estimation of within-speaker scatter matrix in WCCN and LDA techniques. However, neither language normalization nor the state-of-the-art PLDA algorithm is capable of modeling language variability on a dataset with insufficient multilingual utterances for each speaker, resulting in a poor performance in cross-language trial condition.This study is an extension to our initial developments of a language-independent PLDA training algorithm which aimed at reducing the effect of language as a source of variability on the performance of speaker recognition. We will provide a thorough analysis of how the proposed approach can utilize multilingual training data from bilingual speakers to robustly compensate for the effect of languages. Evaluated on multilingual trial condition, the proposed solution demonstrated over 10% EER and 13% minimum DCF relative improvement on NIST 2008 speaker recognition evaluation as well as 12.4% EER and 23% minimum DCF on PRISM evaluation set over the baseline system while also providing improvement in other trial conditions.

[1]  Christopher Cieri,et al.  Resources for new research directions in speaker recognition: the mixer 3, 4 and 5 corpora , 2007, INTERSPEECH.

[2]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[3]  Lukás Burget,et al.  Analysis of DNN approaches to speaker identification , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Simon Dobrisek,et al.  Incorporating Duration Information into I-Vector-Based Speaker Recognition Systems , 2014, Odyssey.

[5]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[6]  Andreas Stolcke,et al.  Multispeaker speech activity detection for the ICSI meeting recorder , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[7]  Sridha Sridharan,et al.  Improving out-domain PLDA speaker verification using unsupervised inter-dataset variability compensation approach , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Niko Brümmer,et al.  Unsupervised Domain Adaptation for I-Vector Speaker Recognition , 2014, Odyssey.

[9]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  David A. van Leeuwen,et al.  Source-Normalized LDA for Robust Speaker Recognition Using i-Vectors From Multiple Speech Sources , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[12]  David A. van Leeuwen,et al.  Source-normalised-and-weighted LDA for robust speaker recognition using i-vectors , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[14]  Sébastien Marcel,et al.  Hierarchical speaker clustering methods for the NIST i-vector Challenge , 2014, Odyssey.

[15]  Mohammad Mehdi Homayounpour,et al.  Linearly Constrained Minimum Variance for Robust I-vector Based Speaker Recognition , 2014, Odyssey.

[16]  Alan McCree,et al.  Supervised domain adaptation for I-vector based speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[18]  Spyridon Matsoukas,et al.  Domain adaptation via within-class covariance correction in I-vector based speaker recognition systems , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Liang Lu,et al.  The effect of language factors for robust speaker recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Hagai Aronowitz,et al.  Inter dataset variability compensation for speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Jean-Pierre Martens,et al.  Combining Joint Factor Analysis and iVectors for Robust Language Recognition , 2014, Odyssey.

[22]  David A. van Leeuwen,et al.  Source normalization for language-independent speaker recognition using i-vectors , 2012, Odyssey.

[23]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[24]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[25]  Alvin F. Martin,et al.  NIST 2008 speaker recognition evaluation: performance across telephone and room microphone channels , 2009, INTERSPEECH.

[26]  Mark Liberman,et al.  Speech activity detection on youtube using deep neural networks , 2013, INTERSPEECH.

[27]  Gérard Chollet,et al.  A PLDA Approach for Language and Text Independent Speaker Recognition , 2016, Odyssey.

[28]  Nima Mesgarani,et al.  Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Liang He,et al.  Investigation of bottleneck features and multilingual deep neural networks for speaker verification , 2015, INTERSPEECH.

[30]  Lukás Burget,et al.  Analysis and Optimization of Bottleneck Features for Speaker Recognition , 2016, Odyssey.

[31]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[32]  Douglas A. Reynolds,et al.  The NIST 2014 Speaker Recognition i-vector Machine Learning Challenge , 2014, Odyssey.

[33]  John H. L. Hansen,et al.  Spoken language mismatch in speaker verification: An investigation with NIST-SRE and CRSS Bi-Ling corpora , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[34]  Spyridon Matsoukas,et al.  Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[35]  Sergey Novoselov,et al.  STC Speaker Recognition System for the NIST i-Vector Challenge , 2014, Odyssey.