TEII: Topic enhanced inverted index for top-k document retrieval

Abstract In recent years, topic modeling is gaining significant momentum in information retrieval (IR). Researchers have found that utilizing the topic information generated through topic modeling together with traditional TF-IDF information generates superior results in document retrieval. However, in order to apply this idea to real-life IR systems, some critical problems need to be solved: how to store the topic information and how to utilize it with the TF-IDF information for efficient document retrieval. In this paper, we propose the Topic Enhanced Inverted Index (TEII) to incorporate the topic information into the inverted index for efficient top- k document retrieval. Specifically, we explore two different types of TEIIs. We first propose the incremental TEII, which includes the topic information into the traditional inverted index by adding topic-based inverted lists. The incremental TEII is beneficial for legacy IR systems, since it does not change the existing TF-IDF-based inverted lists. As a more flexible alternative, we propose the hybrid TEII to incorporate the topic information into each posting of the inverted index. In the hybrid TEII, two relaxation methods are proposed to support dynamic estimation of the upper bound impact of each posting. The hybrid TEII is highly extensible for incorporating different ranking factors and we show an extension of the hybrid TEII by considering the static quality of the documents in the corpus. Based on the incremental and hybrid TEIIs, we develop several query processing algorithms to support efficient top- k document retrieval on TEIIs. Empirical evaluation on the TREC dataset verifies the effectiveness and efficiency of the proposed index structures and query processing algorithms.

[1]  Torsten Suel,et al.  Optimizing top-k document retrieval strategies for block-max indexes , 2013, WSDM.

[2]  Hongfei Yan,et al.  Optimized top-k processing with global page scores on block-max indexes , 2012, WSDM '12.

[3]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[4]  Torsten Suel,et al.  Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[5]  Jiří Mazurek,et al.  EVALUATION OF RANKING SIMILARITY IN ORDINAL RANKING PROBLEMS , 2011 .

[6]  Sergej Sizov,et al.  GeoFolk: latent spatial semantics in web 2.0 social media , 2010, WSDM '10.

[7]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[8]  Sergei Vassilvitskii,et al.  Generalized distances between rankings , 2010, WWW '10.

[9]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[10]  Yue Xu,et al.  A topic based document relevance ranking model , 2014, WWW '14 Companion.

[11]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[12]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[13]  Charles L. A. Clarke,et al.  Information Retrieval - Implementing and Evaluating Search Engines , 2010 .

[14]  Jimmy J. Lin,et al.  A cascade ranking model for efficient ranked retrieval , 2011, SIGIR.

[15]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[16]  Yue Xu,et al.  A Two-Stage Approach for Generating Topic Models , 2013, PAKDD.

[17]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[18]  Yue Xu,et al.  Topical Pattern Based Document Modelling and Relevance Ranking , 2014, WISE.

[19]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[20]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[21]  Alistair Moffat,et al.  Structured Index Organizations for High-Throughput Text Querying , 2006, SPIRE.

[22]  James Allan,et al.  Evaluating topic models for information retrieval , 2008, CIKM '08.

[23]  Justin Zobel,et al.  Efficient single-pass index construction for text databases , 2003, J. Assoc. Inf. Sci. Technol..

[24]  Yue Xu,et al.  Pattern-Based Topic Models for Information Filtering , 2013, 2013 IEEE 13th International Conference on Data Mining Workshops.

[25]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[26]  Torsten Suel,et al.  Faster top-k document retrieval using block-max indexes , 2011, SIGIR.

[27]  Tien-Fu Chen,et al.  Inverted file compression through document identifier reassignment , 2003, Inf. Process. Manag..

[28]  Jordan L. Boyd-Graber,et al.  Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce , 2012, WWW.