Automatically learning speaker-independent acoustic subword units

We investigate methods for unsupervised learning of sub-word acoustic units of a language directly from speech. We demonstrate that states of a hidden Markov model “grown” using a novel modification of the maximum likelihood successive state splitting algorithm correspond very well with the phones of the language. In particular, the correspondence between the Viterbi state sequence for unseen speech from the training speaker and the phone transcription of the speech is over 85%, and generalizes to a large extent (∼ 63%) to speech from a different speaker. Furthermore, we are able to bridge more than half the gap between the speaker-dependent and cross-speaker correspondence of the automatically learned units to phones (∼ 75% accuracy) by unsupervised adaptation via MLLR.

[1]  Shigeki Sagayama,et al.  A successive state splitting algorithm for efficient allophone modeling , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Masaaki Honda,et al.  LPC speech coding based on variable-length segment quantization , 1988, IEEE Trans. Acoust. Speech Signal Process..

[3]  Mark J. F. Gales,et al.  The generation and use of regression class trees for MLLR adaptation , 1996 .

[4]  Andreas Stolcke,et al.  Hidden Markov Model} Induction by Bayesian Model Merging , 1992, NIPS.

[5]  M. Ostendorf,et al.  Maximum likelihood successive state splitting , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[6]  Torbjørn Svendsen,et al.  On the automatic segmentation of speech signals , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Sanjeev Khudanpur,et al.  Unsupervised Learning of Acoustic Sub-word Units , 2008, ACL.

[8]  Kuldip K. Paliwal,et al.  Speech recognition based on acoustically derived segment units , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[9]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[10]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[11]  Mark J. F. Gales,et al.  Iterative unsupervised adaptation using maximum likelihood linear regression , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.