论文信息 - Approximate Dimension Equalization in Vector-based Information Retrieval

Approximate Dimension Equalization in Vector-based Information Retrieval

Vector-based information retrieval methods such as the vector space model (VSM), latent semantic indexing (LSI), and the generalized vector space model (GVSM) represent both queries and documents by high-dimensional vectors learned from analyzing a training corpus of text. VSM scales well to large collections, but cannot represent term–term correlations, which prevents it from being used in translingual retrieval. GVSM and LSI can represent term–term correlations, but do not scale well to very large retrieval collections. We present a novel method we call approximate dimension equalization (ADE) that combines ideas from VSM, LSI, and GVSM to produce a method that performs well on large collections, scales well computationally, and can represent term–term correlations. We compare the performance of ADE to the other methods on both large and small collections of both monolingual and bilingual text. ADE outperforms all other methods on large bilingual collections, and performs close to the best in all other cases.

Fan Jiang | Michael L. Littman

[1] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[2] Salim Roukos,et al. Ad hoc and Multilingual Information Retrieval at IBM , 1998, TREC.

[3] Susan T. Dumais,et al. Automatic 3-Language Cross-Language Information Retrieval with Latent Semantic Indexing , 1997, TREC.

[4] W. Bruce Croft,et al. Query expansion using local and global document analysis , 1996, SIGIR '96.

[5] Gene H. Golub,et al. Matrix computations , 1983 .

[6] P. C. Wong,et al. Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[7] Stephen E. Robertson,et al. Okapi at TREC , 1992, TREC.

[8] Jean Paul Ballerini,et al. Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.

[9] Stephen E. Robertson,et al. Okapi at TREC-3 , 1994, TREC.

[10] Fan Jiang,et al. Learning a Language-Independent Representation for Terms from a Partially Aligned Corpus , 1998, ICML.

[11] Yiming Yang,et al. Translingual Information Retrieval: Learning from Bilingual Corpora , 1998, Artif. Intell..

[12] Z. Zhang,et al. On matrices with low-rank-plus-shift structure: Partial SVD and latent semantic indexing , 1998 .

[13] Hongyuan Zha,et al. Matrices with Low-Rank-Plus-Shift Structure: Partial SVD and Latent Semantic Indexing , 1999, SIAM J. Matrix Anal. Appl..