Improving Retrieval Effectiveness by Using Key Terms in Top Retrieved Documents

In this paper, we propose a method to improve the precision of top retrieved documents in Chinese information retrieval where the query is a short description by re-ordering retrieved documents in the initial retrieval. To re-order the documents, we firstly find out terms in query and their importance scales by making use of the information derived from top N(N<=30) retrieved documents in the initial retrieval; secondly, we re-order retrieved K(N<<K) documents by what kinds of terms of query they contain. That is, we first automatically extract key terms from top N retrieved documents, then we collect key terms that occur in query and their document frequencies in the N retrieved documents, finally we use these collected terms to re-order the initially retrieved documents. Each collected term is assigned a weight by its length and its document frequency in top N retrieved documents. Each document is re-ranked by the sum of weights of collected terms it contains. In our experiments on 42 query topics in NTCIR3 Cross Lingual Information Retrieval (CLIR) dataset, an average 17.8%-27.5% improvement can be made for top 10 documents and an average 6.6%-26.9% improvement can be made for top 100 documents at relax/rigid relevance judgment and different parameter setting.

[1]  Jaap Kamps,et al.  Improving Retrieval Effectiveness by Reranking Documents Based on Controlled Vocabulary , 2004, ECIR.

[2]  Dong-Hong Ji,et al.  Chinese Language IR based on Term Extraction , 2002, NTCIR.

[3]  Yang Lingpeng,et al.  Document Re-ranking Based on Automatically Acquired Key Terms in Chinese Information Retrieval , 2004, COLING.

[4]  Stephen E. Robertson,et al.  Query Expansion with Long-Span Collocates , 2003, Information Retrieval.

[5]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[6]  Stephen E. Robertson,et al.  Okapi at TREC-2 , 1993, TREC.

[7]  Jian Zhang,et al.  On the use of words and n-grams for Chinese information retrieval , 2000, IRAL '00.

[8]  Dong-Hong Ji,et al.  Chinese Information Retrieval Based on Terms and Ontology , 2004, NTCIR.

[9]  Jun Wang,et al.  Rerank Method Based on Individual Thesaurus , 2001, NTCIR.

[10]  John Bear,et al.  Using Information Extraction to Improve Document Retrieval , 1998, TREC.

[11]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[12]  Dong-Hong Ji,et al.  Document Re-ranking Based on Automatically Acquired Key Terms in Chinese Information Retrieval , 2004, COLING.

[13]  Stephen E. Robertson,et al.  Microsoft Cambridge at TREC 2002: Filtering Track , 2002, TREC.

[14]  Stephen E. Robertson,et al.  Microsoft Cambridge at TREC-9: Filtering Track , 2000, TREC.

[15]  Chris Buckley,et al.  Improving automatic query expansion , 1998, SIGIR '98.

[16]  Hinrich Schütze The hypertext concordance: a better back-of-the-book index , 1998 .

[17]  Key-Sun Choi,et al.  Document Re-ranking Model Using Clusters , 1999 .

[18]  J. Donghong,et al.  Online discovery of relevant terms from Internet , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.

[19]  Key-Sun Choi,et al.  Re-ranking model based on document clusters , 2001, Inf. Process. Manag..

[20]  Claudio Carpineto,et al.  Improving retrieval feedback with multiple term-ranking function combination , 2002, TOIS.

[21]  Kui-Lam Kwok Comparing representations in Chinese information retrieval , 1997, SIGIR '97.

[22]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.