The University of Amsterdam at NTCIR-5

We describe the University of Amsterdam’s participation in the Cross-Lingual Information Retrieval task at NTCIR-5. We focused on Chinese monolingual retrieval, and aimed to study the effectiveness of language models and different tokenization methods for Chinese. Our main findings are the following. First, where the vector space model excels on a bigram index, the language model performs poorly. Second, on a unigram index, the language model is very effective, and even exceeds the performance of the vector space model on the bigram index. Third, and at a more technical level, in comparison to word-based langauges such as English we found that language models for Chinese require less smoothing, due to the different in