ACCURATE KEYWORD SPOTTING USING STRICTLY LEXICAL FILLERS

Our goal is to design an accurate keyword spotter that can deal with any size of keyword set, since the size ac- tually required in a wide range of applications is large (number of airports, number of names in a directory, etc.). This justifies the choice of an architecture based on a large-vocabulary continuous-speech recognizer. In a pre- vious paper (l) we introduced the use of strictly-lexical subword fillers for keyword spotting based on the INRS large-vocabulary continuous-speech recognizer (2) showing that they are, when compared to acoustic fillers, a good compnomise between memory and time consumption, key- word choice freedom and task-independence training on one hand and accuracy on the other hand. We propose here two new high-performance designs of individual strictly-lexical subword fillers that perform, this time, better than their acoustic counterparts while still keeping the mentioned ad- vantages. tween them is performed through the lexical graph as well as the language model. Thus the training-part of the keyword spotter is task-independent, while the detection-part con- sumes less memory and time for model-score determination than when the acoustic fillers were used for discrimination. The use of individual strictly-lexical subword fillers with an adequate language model instead of a background word model (6) is motivated by the importance of the language- specific lexical constraint brought by subword unigram or bigram frequencies. We present here two high-performance individual strictly-lexical subword filler architectures differ- ing in the orthography of the fillers in the lexicon: the first one is phonemic-based while the second one is syllabic- based. 2. KEYWORD SPOTTER DESCRIPTION 2.1. The INRS Continuous-Speech Recognizer Our keyword spotter is based on the INRS continuous- speech recognizer (2) which is an HMM-based real-time very-large-vocabulary continuous speech recognizer. An overview of this recognizer is necessary to the understand- ing of the final system. This recognizer processes the input speech block after block, the output beam of a block becoming the input beam of the following one. The lexicon presents for each word or- thography all the different corresponding pronunciations, The system transforms the lexicon into an ordered lexi- cal tree; only phoneme sequences belonging to this graph will be recognized. From this lexical tree, with the use of the computed table of context-dependent phonemes scores (B*), phonetic transcriptions are scored through the two passes; then with the use of the given language models, the most probable word strings are derived. The INRS recognizer used here computes language mod- els based on the deterministic back-off form from bigram distributions P(w;/wN), and unigram distributions P(wi), where w; is the considered word and WN the preceding one in its history. The language model score contribution to the final score is given through the formula: