Speaker-independent speech recognition based on tree-structured speaker clustering

Abstract We have already proposed the application of tree-structured speaker clustering to supervised speaker adaptation. This paper proposes its application to unsupervised speaker adaptation and speaker-independent (SI) speech recognition. This clustering involves the selection of a speaker cluster from among multiple reference speaker clusters arranged in a tree structure. Cluster selection, unlike parameter training, enables quick adaptation using only a small amount of training data. This method was applied to a hidden Markov network (HMnet) and evaluated in Japanese phoneme and phrase recognition experiments. Results show effective unsupervised speaker adaptation using only 5 s calibration speech. In the SI speech recognition experiments, the method reduced the error rate by 8·5% compared with the conventional speaker-independent speech recognition method.

[1]  Kohji Fukunaga,et al.  Introduction to Statistical Pattern Recognition-Second Edition , 1990 .

[2]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[3]  L. R. Rabiner,et al.  A probabilistic distance measure for hidden Markov models , 1985, AT&T Technical Journal.

[4]  Tetsuo Kosaka,et al.  Rapid speaker adaptation using speaker-mixture allophone models applied to speaker-independent speech recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Kiyohiro Shikano,et al.  Isolated word recognition using phoneme-like templates , 1983, ICASSP.

[6]  Jean-Luc Gauvain,et al.  Speaker adaptation based on MAP estimation of HMM parameters , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Shigeki Sagayama,et al.  Speaker adaptation based on transfer vector field smoothing with continuous mixture density HMMs , 1992, ICSLP.

[8]  Tetsuo Kosaka,et al.  Tree-structured speaker clustering for fast speaker adaptation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  B. Juang,et al.  Context-dependent Phonetic Hidden Markov Models for Speaker-independent Continuous Speech Recognition , 2008 .

[10]  Kenji Kita,et al.  Phoneme-context-dependent LR parsing algorithms for HMM-based continuous speech recognition , 1991, EUROSPEECH.

[11]  John Makhoul,et al.  BYBLOS: The BBN continuous speech recognition system , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Laurent Miclet,et al.  Speaker hierarchical clustering for improving speaker-independent HMM word recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[13]  Shigeki Sagayama,et al.  A successive state splitting algorithm for efficient allophone modeling , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Victor Zue,et al.  Correlation analysis of vowels and their application to speech recognition , 1991, EUROSPEECH.

[15]  M. Sugiyama Unsupervised speaker adaptation methods for vowel templates , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.