Speaker adaptation based on the multilinear decomposition of training speaker models

This paper presents a novel speaker adaptation method based on the multilinear analysis of training speakers using Tucker decomposition. A Tucker decomposition of training models can decouple the dataset into the subspaces of state, dimension of the mean vector, and speaker. Using the bases of the state subspace, we derive a speaker adaptation formula where the matrix of basis vectors is weighted in row and column spaces; the proposed method can include the eigenvoice technique as a subset. The results from the isolated-word recognition task showed that the Tucker decomposition-based method outperformed both eigenvoice and MLLR for the adaptation data whose lengths are 15 seconds or longer. Furthermore, the method can easily be extended to multi-factor problems, thus enabling the adaptation of multiple factors such as speaker and noise environment.