论文信息 - Collaborative Learning for Language and Speaker Recognition

Collaborative Learning for Language and Speaker Recognition

This paper presents a unified model to perform language and speaker recognition simultaneously and altogether. The model is based on a multi-task recurrent neural network where the output of one task is fed as the input of the other, leading to a collaborative learning framework that can improve both language and speaker recognition by borrowing information from each other. Our experiments demonstrated that the multi-task model outperforms the task-specific models on both tasks.

Yang Feng | Dong Wang | Shiyue Zhang | Lantian Li | Zhiyuan Tang

[1] Xiangang Li,et al. Modeling speaker variability using long short-term memory networks for speech recognition , 2015, INTERSPEECH.

[2] Douglas A. Reynolds,et al. A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[3] J. Gonzalez-Dominguez,et al. Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks , 2016, PloS one.

[4] Jean-Luc Gauvain,et al. A Divide-and-Conquer Approach for Language Identification Based on Recurrent Neural Networks , 2016, INTERSPEECH.

[5] Roland Auckenthaler,et al. Language dependency in text-independent speaker verification , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[6] Mikhail Kotov,et al. Language Identification Using Time Delay Neural Network D-Vector on Short Utterances , 2016, SPECOM.

[7] Lukás Burget,et al. Language Recognition in iVectors Space , 2011, INTERSPEECH.

[8] Thomas Fang Zheng,et al. Language-aware PLDA for multilingual speaker recognition , 2016, 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA).

[9] Fang Chen,et al. Improvements on hierarchical language identification based on automatic language clustering , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10] Georg Heigold,et al. End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Sanjeev Khudanpur,et al. Parallel training of DNNs with Natural Gradient and Parameter Averaging , 2014 .

[12] Li-Rong Dai,et al. LID-senone Extraction via Deep Neural Networks for End-to-End Language Identification , 2016, Odyssey.

[13] Yun Lei,et al. A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Dong Wang,et al. Multi-task recurrent model for speech and speaker recognition , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[15] Alan McCree,et al. Stacked Long-Term TDNN for Spoken Language Recognition , 2016, INTERSPEECH.

[16] Bin Ma,et al. English-Chinese bilingual text-independent speaker verification , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17] John H. L. Hansen,et al. Spoken language mismatch in speaker verification: An investigation with NIST-SRE and CRSS Bi-Ling corpora , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[18] Patrick Kenny,et al. Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[19] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[20] Florin Curelaru,et al. Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[21] Sanjeev Khudanpur,et al. Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[22] Rubén San-Segundo-Hernández,et al. On the use of phone-gram units in recurrent neural networks for language identification , 2016, Odyssey.

[23] Xiaohui Zhang,et al. Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging , 2014, ICLR.

[24] Yan Song,et al. i-vector representation based on bottleneck features for language identification , 2013 .

[25] Thomas Fang Zheng,et al. Transfer learning for speech and language processing , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[26] Doroteo Torre Toledano,et al. An end-to-end approach to language identification in short utterances using convolutional neural networks , 2015, INTERSPEECH.

[27] Dong Wang,et al. Improved deep speaker feature learning for text-dependent speaker recognition , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[28] Yi Liu,et al. Investigation of Senone-based Long-Short Term Memory RNNs for Spoken Language Recognition , 2016, Odyssey.

[29] Thomas Fang Zheng,et al. Cross-lingual speaker verification based on linear transform , 2015, 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP).

[30] Yifan Gong,et al. End-to-End attention based text-dependent speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[31] Douglas A. Reynolds,et al. Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[32] Yifan Gong,et al. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33] Douglas A. Reynolds,et al. Improved GMM-based language recognition using constrained MLLR transforms , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34] Erik McDermott,et al. Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[36] Dong Yu,et al. Neural Network Based Multi-Factor Aware Joint Training for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[37] Lukás Burget,et al. Brno University of Technology System for NIST 2005 Language Recognition Evaluation , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[38] William M. Campbell,et al. Support vector machines for speaker and language recognition , 2006, Comput. Speech Lang..

[39] Andrew W. Senior,et al. Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[40] Marc A. Zissman,et al. Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[41] Jean-Luc Gauvain,et al. Language identification using phone-based acoustic likelihoods , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[42] Liang Lu,et al. The effect of language factors for robust speaker recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[43] Joaquín González-Rodríguez,et al. Automatic language identification using long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[44] Jirí Navrátil,et al. Spoken language recognition-a step toward multilinguality in speech processing , 2001, IEEE Trans. Speech Audio Process..

[45] Joaquín González-Rodríguez,et al. Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).