The Effect of Term Importance Degree on Text Retrieval

approaches to index term-weighting have been investigated. In fact, term-weighting is an indispensable process for document ranking in most retrieval systems. As well actual information retrieval systems have to deal with explosive growth of documents of various sizes and terms of various frequencies because an appropriate term-weighting scheme has a crucial impact on the overall performance of systems. This paper attempts to investigate the impact of term-weighting parameters used in the most well-known retrieval models. The study has been particularly focused on normalization of term frequency in weighting schemes. A novel factor which is called "term importance degree" has been identified, which can be applied to term-weighting schemes by using several parameters. The calculated correlations between the parameters of weighting schemes confirmed the impact of this factor to increase the performance of text retrieval systems. Two models of term frequency normalization are inserted in a basic term- weighting scheme, which shows the importance of terms. The experiments were carried out on the standard test collections which validated by multiple statistical tests.

[1]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[2]  Gerard Salton,et al.  Syntactic Approaches to Automatic Book Indexing , 1988, ACL.

[3]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[4]  Ben He,et al.  Document Length Normalization , 2009, Encyclopedia of Database Systems.

[5]  Ellen M. Voorhees,et al.  Evaluating evaluation measure stability , 2000, SIGIR '00.

[6]  Amit Singhal,et al.  AT&T at TREC-7 , 1998, TREC.

[7]  Stephen E. Robertson,et al.  Okapi at TREC-4 , 1995, TREC.

[8]  M. E. Maron,et al.  Automatic Indexing: An Experimental Inquiry , 1961, JACM.

[9]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[10]  Tao Tao,et al.  A formal study of information retrieval heuristics , 2004, SIGIR '04.

[11]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[12]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[13]  David Hawking,et al.  Overview of the TREC-2002 Web Track , 2002, TREC.

[14]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[15]  Alistair Moffat,et al.  Simplified similarity scoring using term ranks , 2005, SIGIR '05.

[16]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[17]  Stephen E. Robertson,et al.  Okapi/Keenbow at TREC-8 , 1999, TREC.

[18]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[19]  Ronan Cummins,et al.  An evaluation of evolved term-weighting schemes in information retrieval , 2005, CIKM '05.