Adversarial Multi-task Deep Features and Unsupervised Back-end Adaptation for Language Recognition

This paper presents an investigation into speaker-invariant feature learning and domain adaptation for language recognition (LR) with short utterances. While following the conventional design of i-vector front-end and probabilistic linear discriminant analysis (PLDA) back-end, we propose to apply speaker adversarial multi-task learning (AMTL) to aim explicitly at learning speaker-invariant multilingual bottleneck features and perform unsupervised PLDA adaptation to alleviate performance degradation caused by domain mismatch between training and test data. Through a demo experiment, we show the adverse effect of domain mismatch and motivate the necessity of domain adaptation. LR experiments are carried out with the AP17-OLR challenge dataset to evaluate the effectiveness of the proposed methods in comparison with the state of the art. The results show that both speaker AMTL and unsupervised PLDA adaptation contribute significantly to performance improvement on the short-duration LR task. The effectiveness of PLDA adaptation is found to be insensitive to the number of clusters assumed in unsupervised data labeling. Our best system outperforms the state-of-the-art system of AP17-OLR and shows relative improvements of 6.98% in terms of Cavg and 4.80% in terms of EER on 1-second test set.

[1]  Niko Brümmer,et al.  Unsupervised Domain Adaptation for I-Vector Speaker Recognition , 2014, Odyssey.

[2]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  John H. L. Hansen,et al.  Maximum-Likelihood Linear Transformation for Unsupervised Domain Adaptation in Speaker Verification , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[5]  Douglas A. Reynolds,et al.  A unified deep neural network for speaker and language recognition , 2015, INTERSPEECH.

[6]  Haizhou Li,et al.  End-to-End Hierarchical Language Identification System , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[9]  Sri Harish Reddy Mallidi,et al.  Exploiting Hidden-Layer Responses of Deep Neural Networks for Language Recognition , 2016, INTERSPEECH.

[10]  Mei-Yuh Hwang,et al.  Domain Adversarial Training for Accented Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  John H. L. Hansen,et al.  Duration mismatch compensation for i-vector based speaker recognition systems , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[13]  Joaquín González-Rodríguez,et al.  DNN Based Embeddings for Language Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  John H. L. Hansen,et al.  Compensation for Domain Mismatch in Text-independent Speaker Recognition , 2018, INTERSPEECH.

[15]  Shuang Xu,et al.  Multilingual Recurrent Neural Networks with Residual Learning for Low-Resource Speech Recognition , 2017, INTERSPEECH.

[16]  Dong Wang,et al.  AP17-OLR challenge: Data, plan, and baseline , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[17]  Sanjeev Khudanpur,et al.  Spoken Language Recognition using X-vectors , 2018, Odyssey.

[18]  Mitchell McLaren,et al.  Analyzing the Effect of Channel Mismatch on the SRI Language Recognition Evaluation 2015 System , 2016, Odyssey.

[19]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[20]  Niko Brümmer,et al.  End-to-End versus Embedding Neural Networks for Language Recognition in Mismatched Conditions , 2018, Odyssey.

[21]  Jean-Luc Gauvain,et al.  Spoken Language Identification Using LSTM-Based Angular Proximity , 2017, INTERSPEECH.

[22]  Yusuke Shinohara,et al.  Adversarial Multi-Task Learning of Deep Neural Networks for Robust Speech Recognition , 2016, INTERSPEECH.

[23]  Douglas A. Reynolds,et al.  Domain Mismatch Compensation for Speaker Recognition Using a Library of Whiteners , 2015, IEEE Signal Processing Letters.

[24]  Biing-Hwang Juang,et al.  Speaker-Invariant Training Via Adversarial Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Alan McCree,et al.  Subspace-constrained supervector PLDA for speaker verification , 2013, INTERSPEECH.

[26]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[28]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[29]  Joaquín González-Rodríguez,et al.  Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Sri Harish Reddy Mallidi,et al.  Neural Network Bottleneck Features for Language Identification , 2014, Odyssey.

[31]  Tan Lee,et al.  Improving Cross-Lingual Knowledge Transferability Using Multilingual TDNN-BLSTM with Language-Dependent Pre-Final Layer , 2018, INTERSPEECH.

[32]  Doroteo Torre Toledano,et al.  An end-to-end approach to language identification in short utterances using convolutional neural networks , 2015, INTERSPEECH.

[33]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[34]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Dong Wang,et al.  Phonetic Temporal Neural Model for Language Identification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[36]  Pavel Matejka,et al.  Multilingual bottleneck features for language recognition , 2015, INTERSPEECH.

[37]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[38]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[39]  Biing-Hwang Juang,et al.  Adversarial Teacher-Student Learning for Unsupervised Domain Adaptation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).