Recursive Whitening Transformation for Speaker Recognition on Language Mismatched Condition

Recently in speaker recognition, performance degradation due to the channel domain mismatched condition has been actively addressed. However, the mismatches arising from language is yet to be sufficiently addressed. This paper proposes an approach which employs recursive whitening transformation to mitigate the language mismatched condition. The proposed method is based on the multiple whitening transformation, which is intended to remove un-whitened residual components in the dataset associated with i-vector length normalization. The experiments were conducted on the Speaker Recognition Evaluation 2016 trials of which the task is non-English speaker recognition using development dataset consist of both a large scale out-of-domain (English) dataset and an extremely low-quantity in-domain (non-English) dataset. For performance comparison, we develop a state-of- the-art system using deep neural network and bottleneck feature, which is based on a phonetically aware model. From the experimental results, along with other prior studies, effectiveness of the proposed method on language mismatched condition is validated.

[1]  Douglas A. Reynolds,et al.  Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[2]  Niko Brümmer,et al.  Unsupervised Domain Adaptation for I-Vector Speaker Recognition , 2014, Odyssey.

[3]  Douglas A. Reynolds,et al.  Unsupervised Clustering Approaches for Domain Adaptation in Speaker Recognition Systems , 2014, Odyssey.

[4]  Douglas A. Reynolds,et al.  Domain Mismatch Compensation for Speaker Recognition Using a Library of Whiteners , 2015, IEEE Signal Processing Letters.

[5]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[6]  Alan McCree,et al.  Supervised domain adaptation for I-vector based speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Sridha Sridharan,et al.  Dataset-invariant covariance normalization for out-domain PLDA speaker verification , 2015, INTERSPEECH.

[9]  W. Marsden I and J , 2012 .

[10]  Daniel Garcia-Romero,et al.  Time delay deep neural network-based universal background models for speaker recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[11]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[12]  Hanseok Ko,et al.  Maximum likelihood Linear Dimension Reduction of heteroscedastic feature for robust Speaker Recognition , 2015, 2015 12th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[13]  Spyridon Matsoukas,et al.  Domain adaptation via within-class covariance correction in I-vector based speaker recognition systems , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  John H. L. Hansen,et al.  Spoken language mismatch in speaker verification: An investigation with NIST-SRE and CRSS Bi-Ling corpora , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[15]  James R. Glass,et al.  Unsupervised Speaker Adaptation based on the Cosine Similarity for Text-Independent Speaker Verification , 2010, Odyssey.

[16]  Niko Brümmer,et al.  The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF , 2013, ArXiv.

[17]  Hanseok Ko,et al.  KU-ISPL Speaker Recognition Systems under Language mismatch condition for NIST 2016 Speaker Recognition Evaluation , 2017, ArXiv.

[18]  Hagai Aronowitz,et al.  Inter dataset variability compensation for speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Lukás Burget,et al.  Analysis and Optimization of Bottleneck Features for Speaker Recognition , 2016, Odyssey.

[20]  Hanseok Ko,et al.  Deep Neural Network Bottleneck Features for Acoustic Event Recognition , 2016, INTERSPEECH.

[21]  Douglas A. Reynolds,et al.  A unified deep neural network for speaker and language recognition , 2015, INTERSPEECH.

[22]  Lukás Burget,et al.  Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).