论文信息 - ACCURATE KEYWORD SPOTTING USING STRICTLY LEXICAL FILLERS

ACCURATE KEYWORD SPOTTING USING STRICTLY LEXICAL FILLERS

Our goal is to design an accurate keyword spotter that can deal with any size of keyword set, since the size ac- tually required in a wide range of applications is large (number of airports, number of names in a directory, etc.). This justifies the choice of an architecture based on a large-vocabulary continuous-speech recognizer. In a pre- vious paper (l) we introduced the use of strictly-lexical subword fillers for keyword spotting based on the INRS large-vocabulary continuous-speech recognizer (2) showing that they are, when compared to acoustic fillers, a good compnomise between memory and time consumption, key- word choice freedom and task-independence training on one hand and accuracy on the other hand. We propose here two new high-performance designs of individual strictly-lexical subword fillers that perform, this time, better than their acoustic counterparts while still keeping the mentioned ad- vantages. tween them is performed through the lexical graph as well as the language model. Thus the training-part of the keyword spotter is task-independent, while the detection-part con- sumes less memory and time for model-score determination than when the acoustic fillers were used for discrimination. The use of individual strictly-lexical subword fillers with an adequate language model instead of a background word model (6) is motivated by the importance of the language- specific lexical constraint brought by subword unigram or bigram frequencies. We present here two high-performance individual strictly-lexical subword filler architectures differ- ing in the orthography of the fillers in the lexicon: the first one is phonemic-based while the second one is syllabic- based. 2. KEYWORD SPOTTER DESCRIPTION 2.1. The INRS Continuous-Speech Recognizer Our keyword spotter is based on the INRS continuous- speech recognizer (2) which is an HMM-based real-time very-large-vocabulary continuous speech recognizer. An overview of this recognizer is necessary to the understand- ing of the final system. This recognizer processes the input speech block after block, the output beam of a block becoming the input beam of the following one. The lexicon presents for each word or- thography all the different corresponding pronunciations, The system transforms the lexicon into an ordered lexi- cal tree; only phoneme sequences belonging to this graph will be recognized. From this lexical tree, with the use of the computed table of context-dependent phonemes scores (B*), phonetic transcriptions are scored through the two passes; then with the use of the given language models, the most probable word strings are derived. The INRS recognizer used here computes language mod- els based on the deterministic back-off form from bigram distributions P(w;/wN), and unigram distributions P(wi), where w; is the considered word and WN the preceding one in its history. The language model score contribution to the final score is given through the formula:

Rachida El Me

[1] R. C. Rose,et al. Keyword detection in conversational speech utterances using hidden Markov model based continuous speech recognition , 1995, Comput. Speech Lang..

[2] P. Ladefoged. A course in phonetics , 1975 .

[3] Douglas D. O'Shaughnessy,et al. Experiments in continuous speech recognition using books on tape , 1994, Speech Commun..