University of Chicago at NTCIR4 CLIR: Multi-Scale Query Expansion

Pseudo-relevance feedback, while useful in monolingual applications for refining and enriching short user queries, proves even more important in crosslanguage information retrieval (CLIR). For CLIR, query expansion before and after translation can provide an opportunity to recover from translation gaps, reduce ambiguity, and enhance recall. Furthermore, for CLIR in unsegmented Asian languages, appropriate unit selection for translation, indexing, and retrieval plays a key role. In our NTCIR4 CLIR experiments, we compare the effectiveness of different unit selection strategies - words and subword units - and different stages - pre- and post- translation for query expansion. We find that for the very short queries with many untranslatable words in this test collection, both pre- and post- translation query expansion, independently and in conjunction, significantly enhance retrieval effectiveness for all unit selection strategies. We find, however, no significant differences across unit selection strategies for expansion in merged multilingual runs. However, more detailed per-language analysis finds significantly better effectiveness in Japanese when character-bigram units are employed for the identification of presumed relevant documents during query expansion and word and bigram units are chosen for expansion over approaches that use wordbased units to identify relevant documents.