Divergence estimation based on deep neural networks and its use for language identification

In this paper, we propose a method to estimate statistical divergence between probability distributions by a DNN-based discriminative approach and its use for language identification tasks. Since statistical divergence is generally defined as a functional of two probability density functions, these density functions are usually represented in a parametric form. Then, if a mismatch exists between the assumed distribution and its true one, the obtained divergence becomes erroneous. In our proposed method, by using Bayes' theorem, the statistical divergence is estimated by using DNN as discriminative estimation model. In our method, the divergence between two distributions is able to be estimated without assuming a specific form for these distributions. When the amount of data available for estimation is small, however, it becomes intractable to calculate the integral of the divergence function over all the feature space and to train neural networks. To mitigate this problem, two solutions are introduced; a model adaptation method for DNN and a sampling approach for integration. We apply this approach to language identification tasks, where the obtained divergences are used to extract a speech structure. Experimental results show that our approach can improve the performance of language identification by 10.85% relative compared to the conventional approach based on i-vector.

[1]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[2]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[3]  Nobuaki Minematsu Yet another acoustic representation of speech sounds , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Georg Heigold,et al.  Equivalence of Generative and Log-Linear Models , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Florian Metze,et al.  Improvements to speaker adaptive training of deep neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[6]  Hermann Ney,et al.  Vocal tract normalization equals linear transformation in cepstral space , 2001, IEEE Transactions on Speech and Audio Processing.

[7]  Keikichi Hirose,et al.  Speaker-basis Accent Clustering Using Invariant Structure Analysis and the Speech Accent Archive , 2014, Odyssey.

[8]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[9]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Haizhou Li,et al.  GMM-SVM Kernel With a Bhattacharyya-Based Distance for Speaker Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Nobuaki Minematsu,et al.  Speech Structure and Its Application to Robust Speech Processing , 2009, New Generation Computing.

[12]  Nobuaki Minematsu,et al.  A Study on Invariance of $f$-Divergence and Its Application to Speech Recognition , 2010, IEEE Transactions on Signal Processing.