Combining implicit and explicit topic representations for result diversification

Result diversification deals with ambiguous or multi-faceted queries by providing documents that cover as many subtopics of a query as possible. Various approaches to subtopic modeling have been proposed. Subtopics have been extracted internally, e.g., from retrieved documents, and externally, e.g., from Web resources such as query logs. Internally modeled subtopics are often implicitly represented, e.g., as latent topics, while externally modeled subtopics are often explicitly represented, e.g., as reformulated queries. We propose a framework that: i)combines both implicitly and explicitly represented subtopics; and ii)allows flexible combination of multiple external resources in a transparent and unified manner. Specifically, we use a random walk based approach to estimate the similarities of the explicit subtopics mined from a number of heterogeneous resources: click logs, anchor text, and web n-grams. We then use these similarities to regularize the latent topics extracted from the top-ranked documents, i.e., the internal (implicit) subtopics. Empirical results show that regularization with explicit subtopics extracted from the right resource leads to improved diversification results, indicating that the proposed regularization with (explicit) external resources forms better (implicit) topic models. Click logs and anchor text are shown to be more effective resources than web n-grams under current experimental settings. Combining resources does not always lead to better results, but achieves a robust performance. This robustness is important for two reasons: it cannot be predicted which resources will be most effective for a given query, and it is not yet known how to reliably determine the optimal model parameters for building implicit topic models.

[1]  Bert R. Boyce,et al.  Beyond topicality : A two stage view of relevance and the retrieval process , 1982, Inf. Process. Manag..

[2]  Jiyin He,et al.  Exploring topic structure: coherence, diversity and relatedness , 2012, SIGF.

[3]  Charles L. A. Clarke,et al.  Overview of the TREC 2010 Web Track , 2010, TREC.

[4]  Xueqi Cheng,et al.  Intent-aware query similarity , 2011, CIKM '11.

[5]  HeJiyin Exploring topic structure , 2012 .

[6]  Ariel Fuxman,et al.  Using the wisdom of the crowds for keyword generation , 2008, WWW.

[7]  Jiawei Han,et al.  Modeling hidden topics on document manifold , 2008, CIKM '08.

[8]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[9]  Ben Carterette,et al.  Probabilistic models of ranking novel documents for faceted topic retrieval , 2009, CIKM.

[10]  Naftali Tishby,et al.  Data Clustering by Markovian Relaxation and the Information Bottleneck Method , 2000, NIPS.

[11]  David R. Karger,et al.  Less is More Probabilistic Models for Retrieving Fewer Relevant Documents , 2006 .

[12]  Thorsten Joachims,et al.  Predicting diverse subsets using structural SVMs , 2008, ICML '08.

[13]  Xiaojin Zhu,et al.  Improving Diversity in Ranking using Absorbing Random Walks , 2007, NAACL.

[14]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[15]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[16]  Xiaolong Li,et al.  An Overview of Microsoft Web N-gram Corpus and Applications , 2010, NAACL.

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  Nick Craswell,et al.  Random walks on the click graph , 2007, SIGIR.

[19]  W. Bruce Croft,et al.  Inferring query aspects from reformulations using clustering , 2011, CIKM '11.

[20]  Yiqun Liu,et al.  THUIR at TREC 2009 Web Track: Finding Relevant and Diverse Results for Large Scale Web Search , 2009, TREC.

[21]  Sumit Bhatia Multidimensional search result diversification: diverse search results for diverse users , 2011, SIGIR '11.

[22]  Charles L. A. Clarke,et al.  Overview of the TREC 2011 Web Track , 2011, TREC.

[23]  William Goffman,et al.  A searching procedure for information retrieval , 1964, Inf. Storage Retr..

[24]  Jun Wang,et al.  Portfolio theory of information retrieval , 2009, SIGIR.

[25]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[26]  Charles L. A. Clarke,et al.  Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.

[27]  Krishna Bharat,et al.  Diversifying web search results , 2010, WWW '10.

[28]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[29]  Michael R. Lyu,et al.  Diversifying Query Suggestion Results , 2010, AAAI.

[30]  Djoerd Hiemstra,et al.  MIREX: MapReduce Information Retrieval Experiments , 2010, ArXiv.

[31]  Filip Radlinski,et al.  Inferring query intent from reformulations and clicks , 2010, WWW '10.

[32]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[33]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[34]  Craig MacDonald,et al.  Exploiting query reformulations for web search result diversification , 2010, WWW '10.

[35]  Ji-Rong Wen,et al.  Clustering user queries of a search engine , 2001, WWW '01.

[36]  Hongyan Liu,et al.  Multi-view random walk framework for search task discovery from click-through log , 2011, CIKM '11.

[37]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[38]  W. Bruce Croft,et al.  Query reformulation using anchor text , 2010, WSDM '10.

[39]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[40]  James Allan,et al.  Using part-of-speech patterns to reduce query ambiguity , 2002, SIGIR '02.

[41]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[42]  M. de Rijke,et al.  Result diversification based on query-specific cluster ranking , 2011, J. Assoc. Inf. Sci. Technol..