Techniques for automatically transcribing unknown keywords for open keyword set HMM-based word-spotting

Many word-spotting applications require an open keyword vocabulary, allowing the user to search for any term in an audio document database. In conjunction with this, an automatic method of determining the acoustic representation of an arbitrary keyword is needed. For a HMMbased system, where the keyword is represented by a concatenated string of phones, the keyword phone string (KPS), the phonetic transcription must be estimated. This report describes automatic transcription methods for orthographically spelt, spoken, and combined spelt and spoken, keyword input modes. The spoken keyword example case is examined in more detail for the following reasons. Firstly, interaction with an audio-based system is more natural than typing at a keyboard or speaking the orthographic spelling. This is of particular interest for hand-held devices with no, or a limited, keyboard. Secondly, there is likely to be a high occurrence of real names and user-de ned jargon in retrieval requests which are di cult to cover fully in spelling based systems. The basic approach considered is that of using a phone level speech recogniser to hypothesise one or more keyword transcriptions. The e ect on the KPSs of the number of pronunciation strings, the HMM complexity, and the language model used in the phone recogniser, and the number of sample keyword utterances is evaluated through a series of speaker dependent word-spotting experiments on spontaneous speech messages from the Video Mail Retrieval database. Overall it was found that speech derived KPSs are less robust than phonetic dictionary de ned KPSs. However, since the speech-based system does not use a dictionary it has the advantage that it can handle any word or sound. It also requires less memory. Given a single keyword utterance, producing multiple keyword pronunciations using a 7 N-best recogniser was found to give the best word-spotting performance, with a 9.3% drop in performance relative to the phonetic dictionary de ned system for a null grammar, monophone HMM-based KPS recogniser. If two utterances are available, greater robustness can be achieved as the problem of poor keyword examples is partially overcome. Again, a 7 N-best approach yielded the best performance (6.1% relative drop), but good performance was also achieved using the Viterbi string for each utterance (8.5% relative drop), which has a lower computational cost.

[1]  Kate Knill,et al.  Speaker dependent keyword spotting for accessing stored speech , 1994 .

[2]  L. Rabiner,et al.  An algorithm for determining the endpoints of isolated utterances , 1974, The Bell System Technical Journal.

[3]  Michael Picheny,et al.  Automatic phonetic baseform determination , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[4]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[5]  Michael Picheny,et al.  A method for the construction of acoustic Markov models for words , 1993, IEEE Trans. Speech Audio Process..

[6]  Reinhold Häb-Umbach,et al.  Automatic transcription of unknown words in a speech recognition system , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[7]  Vijay Balasubramanian,et al.  Speech-Based Retrieval Using Semantic Co-Occurrence Filtering , 1994, HLT.

[8]  Lynn Wilcox,et al.  HMM-based wordspotting for voice editing and indexing , 1991, EUROSPEECH.

[9]  J. Foote,et al.  WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 1995 .

[10]  M. Knill,et al.  Keyword Training Using a Single Spoken Example for Applications in Audio Document Retrieval , 1994 .

[11]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.