Using the Web corpus to translate the queries in cross-lingual information retrieval

Accurate cross-language information retrieval requires that query terms be correctly translated. In this paper, we propose a new method for Web corpus based query translation, which contains two steps: (1) translation candidate extraction and (2) translation selection. In translation candidate extraction, we use the search engine to find out the corpus data in the target language on the Web by submitting the query in source language. The candidate translations are expected to be both in the title and query-biased summary of searched document. Then we find the intersection substrings of different title pairs (or title-summary pairs) to fix down the possible translation. In translation selection, we determine the possible translation(s) from the candidates by combining substring frequency, inverse translation frequency and top result preferred factor to design the ranking function. Experimental results indicate that the top 3 inclusion rate of translation is 75.57% and our method is also very effective in CLIR task.