Effective level of term frequency impact on large-scale retrieval performance: by top-term ranking method

As the volume of information increases, effective information retrieval methods become more essential to deal with the growth of information. Present document develops a new method to assess the potential role of the term frequency-inverse document frequency measures that are commonly used in text retrieval systems by the vector space model. We carried out preliminary tests to know the effect of term-weighing items on the retrieval performance in a basic scheme of vector space model. With regard to the preliminary tests, we identify a novel factor (effective level of term frequency) that represents the document content based on its length and maximum term-frequency. This factor is used to find the maximum principal terms within the documents and an appropriate subset of documents containing the query terms. Our proposed method (Top-Term Ranking) uses a reduced indexing view of the original terms, where only the principal terms of each document are considered for weighting. Regarding the result of our experiments on TREC collections, the effective level of term frequency (EL) is a significant factor in retrieving relevant documents, especially in large collections. The interest of the Top-Term Ranking method is to increase the performance of the large-scale information retrieval systems more than the common vector space methods.

[1]  Amit Singhal,et al.  AT&T at TREC-7 , 1998, TREC.

[2]  Santosh S. Vempala,et al.  An algorithmic theory of learning: Robust concepts and random projection , 1999, Machine Learning.

[3]  Stephen E. Robertson,et al.  Okapi at TREC-4 , 1995, TREC.

[4]  Iadh Ounis,et al.  Term Frequency Normalisation Tuning for BM25 and DFR Models , 2005, ECIR.

[5]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[6]  Iadh Ounis,et al.  A study of parameter tuning for term frequency normalization , 2003, CIKM '03.

[7]  Stephen E. Robertson,et al.  Okapi/Keenbow at TREC-8 , 1999, TREC.

[8]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[9]  Gerard Salton,et al.  Document Length Normalization , 1995, Inf. Process. Manag..

[10]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[11]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[12]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[13]  Tao Tao,et al.  A formal study of information retrieval heuristics , 2004, SIGIR '04.

[14]  Peter Bailey,et al.  Engineering a multi-purpose test collection for Web retrieval experiments , 2003, Inf. Process. Manag..

[15]  Mohand Boughanem,et al.  Query Modification Based on Relevance Back-Propagation in an Ad hoc Environment , 1999, Inf. Process. Manag..

[16]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[17]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..