Approximate Dimension Equalization in Vector-based Information Retrieval

Vector-based information retrieval methods such as the vector space model (VSM), latent semantic indexing (LSI), and the generalized vector space model (GVSM) represent both queries and documents by high-dimensional vectors learned from analyzing a training corpus of text. VSM scales well to large collections, but cannot represent term–term correlations, which prevents it from being used in translingual retrieval. GVSM and LSI can represent term–term correlations, but do not scale well to very large retrieval collections. We present a novel method we call approximate dimension equalization (ADE) that combines ideas from VSM, LSI, and GVSM to produce a method that performs well on large collections, scales well computationally, and can represent term–term correlations. We compare the performance of ADE to the other methods on both large and small collections of both monolingual and bilingual text. ADE outperforms all other methods on large bilingual collections, and performs close to the best in all other cases.