论文信息 - Using latent semantic indexing for morph-based spoken document retrieval

Using latent semantic indexing for morph-based spoken document retrieval

Previously, phone-based and word-based approaches have been used for spoken document retrieval. The former suffers from high error rates and the latter from limited vocabulary of the recognizer. Our method relies on unlimited vocabulary continuous speech recognizer that uses morpheme-like units discovered in an unsupervised manner. The morpheme-like units, or “morphs” for short, have been successfully used also as index terms. One problem using morphs as index terms is that the segmentation does not always separate the same stem for different inflected forms of the same word. This resembles the problem of synonyms. In this paper, we apply latent semantic indexing to morph based retrieval. The idea is to project morphs that correspond to the same word, as well as other semantically related terms, to the same dimension. The results show clear improvements in Finnish spoken document retrieval performance. Index Terms: spoken document retrieval, latent semantic indexing, morpheme segmentation.

Mikko Kurimo | Ville T. Turunen

[1] Eero Sormunen,et al. A Method for Measuring Wide Range Performance of Boolean Queries in Full-Text Databases , 2000 .

[2] Mathias Creutz,et al. Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[3] Mikko Kurimo,et al. Unlimited vocabulary speech recognition with morph language models applied to Finnish , 2006, Comput. Speech Lang..

[4] Steve Renals,et al. Indexing and retrieval of broadcast news , 2000, Speech Commun..

[5] Mikko Kurimo,et al. To recover from speech recognition errors in spoken document retrieval , 2005, INTERSPEECH.

[6] Mikko Kurimo,et al. An evaluation of a spoken document retrieval baseline system in finish , 2004, INTERSPEECH.

[7] Janne Pylkkönen. New pruning criteria for efficient decoding , 2005, INTERSPEECH.

[8] Peter Schäuble,et al. New techniques for open-vocabulary spoken document retrieval , 1998, SIGIR '98.

[9] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[10] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[11] Ellen M. Voorhees,et al. The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[12] Kimmo Koskenniemi,et al. A General Computational Model for Word-Form Recognition and Production , 1984, ACL.