Berkeley Chinese Information Retrieval at TREC-5: Technical Report

For Chinese track in TREC-5, the collection from the People's Daily and Xinhua News articles includes 164.761 documentss and the volume of the collection is about 170 MB. There are 28 queries. The task of Chinese track is to submit 1000 documents for each query ranked in the order of likelihood of relevance to the query. It is a well known problem that there is no separator between Chinese words so that Chinese words cannot be directly used to index or search text as it is allowes in English. Therefore, people used characters, n-grams, or words as search-able tokens. In TREC-5, we tried to use any meaningful string within text as indexing or search tokens. Our basic strategy is to use an exhaustive dictionary to segment document collection and queries and to use Berkeley TREC2 ad hoc retrieval algorithm to rank retrieved documents