论文信息 - Insights into End-to-End Learning Scheme for Language Identification

Insights into End-to-End Learning Scheme for Language Identification

A novel interpretable end-to-end learning scheme for language identification is proposed. It is in line with the classical GMM i-vector methods both theoretically and practically. In the end-to-end pipeline, a general encoding layer is employed on top of the front-end CNN, so that it can encode the variable-length input sequence into an utterance level vector automatically. After comparing with the state-of-the-art GMM i-vector methods, we give insights into CNN, and reveal its role and effect in the whole pipeline. We further introduce a general encoding layer, illustrating the reason why they might be appropriate for language identification. We elaborate on several typical encoding layers, including a temporal average pooling layer, a recurrent encoding layer and a novel learnable dictionary encoding layer. We conducted experiment on NIST LRE07 closed-set task, and the results show that our proposed end-to-end systems achieve state-of-the-art performance.

Ming Li | Xiaoqi Wang | Weicheng Cai | Wenbo Liu | Zexin Cai

[1] Douglas A. Reynolds,et al. Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[2] Yoshua Bengio,et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[3] James H. Elder,et al. Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[4] Xiao Liu,et al. Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[5] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[6] Douglas E. Sturim,et al. Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[7] Haizhou Li,et al. An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[8] Yun Lei,et al. A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Sanjeev Khudanpur,et al. Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[10] Alex Graves,et al. Long Short-Term Memory , 2020, Computer Vision.

[11] Daniel Garcia-Romero,et al. Time delay deep neural network-based universal background models for speaker recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[12] Bin Ma,et al. Spoken Language Recognition: From Fundamentals to Practice , 2013, Proceedings of the IEEE.

[13] Douglas A. Reynolds,et al. Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[14] Aaron E. Rosenberg,et al. Report: A vector quantization approach to speaker recognition , 1987, AT&T Technical Journal.

[15] Douglas A. Reynolds,et al. Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[16] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).