Search Result Diversification in Short Text Streams

We consider the problem of search result diversification for streams of short texts. Diversifying search results in short text streams is more challenging than in the case of long documents, as it is difficult to capture the latent topics of short documents. To capture the changes of topics and the probabilities of documents for a given query at a specific time in a short text stream, we propose a dynamic Dirichlet multinomial mixture topic model, called D2M3, as well as a Gibbs sampling algorithm for the inference. We also propose a streaming diversification algorithm, SDA, that integrates the information captured by D2M3 with our proposed modified version of the PM-2 (Proportionality-based diversification Method -- second version) diversification algorithm. We conduct experiments on a Twitter dataset and find that SDA statistically significantly outperforms state-of-the-art non-streaming retrieval methods, plain streaming retrieval methods, as well as streaming diversification methods that use other dynamic topic models.

[1]  T. Minka Estimating a Dirichlet distribution , 2012 .

[2]  W. Bruce Croft,et al.  Term level search result diversification , 2013, SIGIR.

[3]  Maarten de Rijke,et al.  Efficient Structured Learning for Personalized Diversification , 2016, IEEE Transactions on Knowledge and Data Engineering.

[4]  Gao Cong,et al.  Diversity-Aware Top-k Publish/Subscribe for Text Stream , 2015, SIGMOD Conference.

[5]  M. de Rijke,et al.  Result diversification based on query-specific cluster ranking , 2011, J. Assoc. Inf. Sci. Technol..

[6]  Sihem Amer-Yahia,et al.  Real-time recommendation of diverse related articles , 2013, WWW.

[7]  Yasushi Sakurai,et al.  Online multiscale dynamic topic models , 2010, KDD.

[8]  M. Cugmas,et al.  On comparing partitions , 2015 .

[9]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[10]  Marcus Fontoura,et al.  Top-k Publish-Subscribe for Social Annotation of News , 2013, Proc. VLDB Endow..

[11]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[12]  Jimeng Sun,et al.  Dynamic Mixture Models for Multiple Time-Series , 2007, IJCAI.

[13]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[14]  W. Bruce Croft,et al.  Diversity by proportionality: an election-based approach to search result diversification , 2012, SIGIR '12.

[15]  Ellen M. Voorhees,et al.  TREC 2014 Web Track Overview , 2015, TREC.

[16]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[17]  Saul Vargas,et al.  Explicit relevance models in intent-oriented information retrieval diversification , 2012, SIGIR '12.

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  M. de Rijke,et al.  Personalized search result diversification via structured learning , 2014, KDD.

[20]  M. de Rijke,et al.  Explainable User Clustering in Short Text Streams , 2016, SIGIR.

[21]  Jade Goldstein-Stewart,et al.  The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries , 1998, SIGIR Forum.

[22]  Katrina Fenlon,et al.  Improving retrieval of short texts through document expansion , 2012, SIGIR '12.

[23]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[24]  Idan Szpektor,et al.  When relevance is not enough: promoting diversity and freshness in personalized question recommendation , 2013, WWW.

[25]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[26]  Jianyong Wang,et al.  A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[27]  M. de Rijke,et al.  Personalized time-aware tweets summarization , 2013, SIGIR.

[28]  Evaggelia Pitoura,et al.  Search result diversification , 2010, SGMD.

[29]  M. de Rijke,et al.  Fusion helps diversification , 2014, SIGIR.

[30]  Jun S. Liu,et al.  The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem , 1994 .

[31]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[32]  Xueqi Cheng,et al.  Modeling Document Novelty with Neural Tensor Network for Search Result Diversification , 2016, SIGIR.

[33]  W. Bruce Croft,et al.  Evaluating Ranking Diversity and Summarization in Microblogs using Hashtags , 2015 .

[34]  Evangelos Kanoulas,et al.  Dynamic Clustering of Streaming Short Documents , 2016, KDD.

[35]  Yang Song,et al.  Topical Keyphrase Extraction from Twitter , 2011, ACL.

[36]  Naonori Ueda,et al.  Topic Tracking Model for Analyzing Consumer Purchase Behavior , 2009, IJCAI.

[37]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[38]  Charles L. A. Clarke,et al.  An Effectiveness Measure for Ambiguous and Underspecified Queries , 2009, ICTIR.

[39]  Craig MacDonald,et al.  Exploiting query reformulations for web search result diversification , 2010, WWW '10.

[40]  Xueqi Cheng,et al.  Learning Maximal Marginal Relevance Model via Directly Optimizing Diversity Evaluation Measures , 2015, SIGIR.

[41]  Wolfgang Nejdl,et al.  Incremental diversification for very large sets: a streaming-based approach , 2011, SIGIR '11.

[42]  Ellen M. Voorhees,et al.  Overview of the TREC 2014 Web Track , 2017 .

[43]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[44]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[45]  Charles L. A. Clarke,et al.  Overview of the TREC 2012 Web Track , 2012, TREC.

[46]  Zhiting Hu,et al.  Dynamic User Modeling in Social Media Systems , 2015, TOIS.

[47]  David R. Karger,et al.  Less is More Probabilistic Models for Retrieving Fewer Relevant Documents , 2006 .

[48]  Jimmy J. Lin,et al.  Overview of the TREC-2014 Microblog Track , 2014, TREC.

[49]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[50]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[51]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[52]  Emine Yilmaz,et al.  Collaborative User Clustering for Short Text Streams , 2017, AAAI.