Unsupervised training of an HMM-based speech recognizer for topic classification

HMM-based Speech-To-Text (STT) systems are widely deployed not only for dictation tasks but also as the first processing stage of many automatic speech applications such as spoken topic classification. However, the necessity of transcribed data for training the HMMs precludes its use in domains where transcribed speech is difficult to come by because of the specific domain, channel or language. In this work, we propose building HMM-based speech recognizers without transcribed data by formulating the HMM training as an optimization over both the parameter and transcription sequence space. We describe how this can be easily implemented using existing STT tools. We tested the effectiveness of our unsupervised training approach on the task of topic classification on the Switchboard corpus. The unsupervised HMM recognizer, initialized with a segmental tokenizer, outperformed both the a HMM phoneme recognizer trained with 1 hour of transcribed data, and the Brno University of Technology (BUT) Hungarian phoneme recognizer. This approach can also be applied to other speech applications, including spoken term detection, language and speaker verification.

[1]  Herbert Gish,et al.  A topic classification system based on parametric trajectory mixture models , 2003, INTERSPEECH.

[2]  J. Cohen Segmenting speech using dynamic programming. , 1981, The Journal of the Acoustical Society of America.

[3]  Timothy J. Hazen,et al.  Topic identification from audio recordings using word and phone recognition lattices , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[4]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[5]  Richard M. Schwartz,et al.  Advances in transcription of broadcast news and conversational telephone speech within the combined EARS BBN/LIMSI system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  William M. Campbell,et al.  Discriminative Keyword Selection Using Support Vector Machines , 2007, NIPS.

[7]  Herbert Gish,et al.  Parametric trajectory models for speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[8]  Pavel Matejka,et al.  Phonotactic language identification using high quality phoneme recognition , 2005, INTERSPEECH.

[9]  Marc A. Zissman,et al.  Automatic language identification using Gaussian mixture and hidden Markov models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Herbert Gish,et al.  A segmental speech model with applications to word spotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.