Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing

Document-centric static index pruning methods provide smaller indexes and faster query times by dropping some withindocument term information from inverted lists. We present a method of pruning inverted lists derived from the formulation of unigram language models for retrieval. Our method is based on the statistical significance of term frequency ratios: using the two-sample two-proportion (2P2N) test, we statistically compare the frequency of occurrence of a word within a given document to the frequency of its occurrence in the collection to decide whether to prune it. Experimental results show that this technique can be used to significantly decrease the size of the index and querying speed with less compromise to retrieval effectiveness than similar heuristic methods. Furthermore, we give a formal statistical justification for such methods.

[1]  Charles L. A. Clarke,et al.  Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval , 2005, TREC.

[2]  Charles L. A. Clarke,et al.  A document-centric approach to static index pruning in text retrieval systems , 2006, CIKM '06.

[3]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track | NIST , 2005 .

[4]  Charles L. A. Clarke,et al.  The TREC 2005 Terabyte Track , 2005, TREC.

[5]  Mario A. Nascimento,et al.  Improving Web search efficiency via a locality based static pruning method , 2005, WWW '05.

[6]  Charles L. A. Clarke,et al.  The TREC 2006 Terabyte Track , 2006, TREC.

[7]  Alistair Moffat,et al.  Inverted Index Compression Using Word-Aligned Binary Codes , 2004, Information Retrieval.

[8]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[9]  Ronald Fagin,et al.  Static index pruning for information retrieval systems , 2001, SIGIR '01.

[10]  Justin Zobel,et al.  Filtered Document Retrieval with Frequency-Sorted Indexes , 1996, J. Am. Soc. Inf. Sci..

[11]  Alistair Moffat,et al.  Pruned query evaluation using pre-computed impacts , 2006, SIGIR.

[12]  Ian H. Witten,et al.  Managing gigabytes 2nd edition , 1999 .

[13]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[14]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[15]  Linh Thai Nguyen Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach , 2009, LSDS-IR@SIGIR.

[16]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[17]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[18]  Andrew Trotman,et al.  Compressing Inverted Files , 2004, Information Retrieval.

[19]  Roi Blanco,et al.  Boosting static pruning of inverted files , 2007, SIGIR.

[20]  Leif Azzopardi,et al.  An Efficient Computation of the Multiple-Bernoulli Language Model , 2006, ECIR.

[21]  W. Bruce Croft,et al.  Indri: A language-model based search engine for complex queries1 , 2005 .

[22]  Justin Zobel,et al.  Dynamic index pruning for effective caching , 2007, CIKM '07.