Latent Semantic Indexing by Self-Organizing Map

An important problem for the information retrieval from spoken documents is how to extract those relevant documents which are poorly decoded by the speech recognizer. In this paper we propose a stochastic index for the documents based on the Latent Semantic Analysis (LSA) of the decoded document contents. The original LSA approach uses Singular Value Decomposition to reduce the dimensionality of the documents. As an alternative, we propose a computationally more feasible solution using Random Mapping (RM) and Self-Organizing Maps (SOM). The motivation for clustering the documents by SOM is to reduce the effect of recognition errors and to extract new characteristic index terms. Experimental indexing results are presented using relevance judgments for the retrieval results of test queries and using a document perplexity defined in this paper to measure the power of the index models.

[1]  Steve Renals,et al.  Retrieval of broadcast news documents with the THISL system , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Samuel Kaski,et al.  Dimensionality reduction by random mapping: fast similarity computation for clustering , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[3]  Johan M. Andersen Baseline System for Hybrid Speech Recognition on French (Experiments on BREF) , 1998 .

[4]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[5]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[6]  Michael W. Berry,et al.  Large-Scale Sparse Singular Value Computations , 1992 .

[7]  Stanley F. Chen,et al.  Evaluation Metrics For Language Models , 1998 .

[8]  Jerome R. Bellegarda,et al.  A statistical language modeling approach integrating local and global constraints , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[9]  Steve Renals,et al.  The THISL Spoken Document Retrieval System , 1998, TREC.

[10]  Mark J. F. Gales,et al.  Improving environmental robustness in large vocabulary speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[11]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .