A Simplified Latent Semantic Indexing Approach for Multi-Linguistic Information Retrieval

Latent Semantic Indexing (LSI) approach provides a promising solution to overcome the language barrier between queries and documents, but unfortunately the high dimensions of the training matrix is computationally prohibitive for its key step of Singular Value Decomposition (SVD). Based on the semantic parallelism of the multi-linguistic training corpus we prove in this paper that, theoretically if the training term-by-document matrix can appear in either of two symmetry forms, strong or weak, the dimension of the matrix under decomposition can be reduced to the size of a monolingual matrix. The retrieval accuracy will not deteriorate in such a simplification. And we also discuss what these two forms of symmetry mean in the context of multi-linguistic information retrieval. Although in real world data the term-by-document matrices are not naturally in either symmetry form, we suggest a way to make them appear more symmetric in the strong form by means of word clustering and term weighting. A real data experiment is also given to support our method of simplification.