Model-averaged latent semantic indexing

This poster introduces a novel approach to information retrieval that uses statistical model averaging to improve latent semantic indexing (LSI). Instead of choosing a single dimensionality $k$ for LSI , we propose using several models of differing dimensionality to inform retrieval. To manage this ensemble we weight each model's contribution to an extent inversely proportional to its AIC (Akaike information criterion). Thus each model contributes proportionally to its expected Kullback-Leibler divergence from the distribution that generated the data. We present results on three standard IR test collections, demonstrating significant improvement over both the traditional vector space model and single-model LSI.