Robust spoken term detection using combination of phone-based and word-based recognition

We propose a robust spoken term detection method against word recognition errors using a combination of phone-based and word-based recognition. Conventional methods based on similar frameworks are problematic because phone-based recognition produces a large number of insertion errors. In our method, different substitution penalties are assigned for phone pairs to reduce such errors. We evaluated our method using the corpus of spontaneous Japanese. When recall was fixed at 50%, precision improved to 4.4 points above detection using only word-based recognition. We also report here on the effectiveness of optimization of the combination weight for each keyword.