Using latent semantic indexing for morph-based spoken document retrieval

Previously, phone-based and word-based approaches have been used for spoken document retrieval. The former suffers from high error rates and the latter from limited vocabulary of the recognizer. Our method relies on unlimited vocabulary continuous speech recognizer that uses morpheme-like units discovered in an unsupervised manner. The morpheme-like units, or “morphs” for short, have been successfully used also as index terms. One problem using morphs as index terms is that the segmentation does not always separate the same stem for different inflected forms of the same word. This resembles the problem of synonyms. In this paper, we apply latent semantic indexing to morph based retrieval. The idea is to project morphs that correspond to the same word, as well as other semantically related terms, to the same dimension. The results show clear improvements in Finnish spoken document retrieval performance. Index Terms: spoken document retrieval, latent semantic indexing, morpheme segmentation.