Structure-constrained distribution matching using quadratic programming and its application to pronunciation evaluation

We proposed a structural representation of speech that is robust to speaker difference due to its transformation-invariant property in previous works, where we compared two speech structures by calculating the distance between two structural vectors, each composed of the lengths of a structure's edges. However, this distance cannot yield matching scores directly related to individual events (nodes) of the two structures. In spite of comparing structural vectors directly, this paper takes structures as constraints for optimal pattern matching. We derive the formulas of objective functions and constraint functions for optimization. Under assumptions of Gaussian and shared covariance matrices, we show that this optimal problem can be reduced to a quadratically constrained quadratic programming problem. To relieve the too strong invariance problem, we use a subspace decomposition method and perform the optimization in each subspace. We evaluate the proposed method on a task to assess the goodness of students' English pronunciation. Experimental results show that the proposed method achieves higher correlations with teachers' manual scores than compared methods.

[1]  Nobuaki Minematsu,et al.  A study on Hidden Structural Model and its application to labeling sequences , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[2]  Nobuaki Minematsu,et al.  A Study on Invariance of $f$-Divergence and Its Application to Speech Recognition , 2010, IEEE Transactions on Signal Processing.

[3]  Keikichi Hirose,et al.  STRUCTURAL REPRESENTATION OF THE PRONUNCIATION AND ITS USE FOR CALL , 2006, 2006 IEEE Spoken Language Technology Workshop.

[4]  Keikichi Hirose,et al.  Optimal event search using a structural cost function - improvement of structure to speech conversion , 2009, INTERSPEECH.

[5]  Nobuaki Minematsu,et al.  Speech Structure and Its Application to Robust Speech Processing , 2009, New Generation Computing.

[6]  Hermann Ney,et al.  Vocal tract normalization equals linear transformation in cepstral space , 2001, IEEE Transactions on Speech and Audio Processing.

[7]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[8]  Alfred Mertins,et al.  Automatic speech recognition and speech variability: A review , 2007, Speech Commun..

[9]  Nobuaki Minematsu Mathematical evidence of the acoustic universal structure in speech , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10]  Nobuaki Minematsu,et al.  Random discriminant structure analysis for automatic recognition of connected vowels , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).