Improvements on Deep Bottleneck Network based I-Vector Representation for Spoken Language Identification

Recently, the i-vector representation based on deep bottleneck networks (DBN) pre-trained for automatic speech recognition has received significant interest for both speaker verification (SV) and language identification (LID). In particular, a recent unified DBN based i-vector framework, referred to as DBN-pGMM i-vector, has performed well. In this paper, we replace the pGMM with a phonetic mixture of factor analyzers (pMFA), and propose a new DBN-pMFA i-vector. The DBN-pMFA i-vector includes the following improvements: (i) a pMFA model is derived from the DBN, which can jointly perform feature dimension reduction and de-correlation in a single linear transformation, (ii) a shifted DBF, termed SDBF, is proposed to exploit the temporal contextual information, (iii) a senone selection scheme is proposed to improve the i-vector extraction efficiently. We evaluate the proposed DBN-pMFA i-vector on the most confused six languages selected from NIST LRE 2009. The experimental results demonstrate that DBN-pMFA can consistently outperform the previous DBN based framework. The computational complexity can be significantly reduced by applying a simple senone selection scheme.

[1]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[2]  M. A. Kohler,et al.  Language identification using shifted delta cepstra , 2002, The 2002 45th Midwest Symposium on Circuits and Systems, 2002. MWSCAS-2002..

[3]  Li-Rong Dai,et al.  Task-aware deep bottleneck features for spoken language identification , 2014, INTERSPEECH.

[4]  Geoffrey E. Hinton,et al.  The EM algorithm for mixtures of factor analyzers , 1996 .

[5]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[6]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[7]  Christopher M. Bishop,et al.  Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.

[8]  Hynek Hermansky,et al.  Improvements in language identification on the RATS noisy speech corpus , 2013, INTERSPEECH.

[9]  Yan Song,et al.  i-vector representation based on bottleneck features for language identification , 2013 .

[10]  Jie Li,et al.  Multilingual tandem bottleneck feature for language identification , 2015, INTERSPEECH.

[11]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[12]  Li-Rong Dai,et al.  Deep bottleneck network based i-vector representation for language identification , 2015, INTERSPEECH.

[13]  John H. L. Hansen,et al.  Acoustic Factor Analysis for Robust Speaker Verification , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[15]  Sri Harish Reddy Mallidi,et al.  Neural Network Bottleneck Features for Language Identification , 2014, Odyssey.

[16]  Mireia Díez,et al.  Dimensionality reduction of phone log-likelihood ratio features for spoken language recognition , 2013, INTERSPEECH.

[17]  Lukás Burget,et al.  Simplification and optimization of i-vector extraction , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Pavel Matejka,et al.  Multilingual bottleneck features for language recognition , 2015, INTERSPEECH.

[19]  Li-Rong Dai,et al.  LID-senone Extraction via Deep Neural Networks for End-to-End Language Identification , 2016, Odyssey.