Query-Based Automatic Training Set Selection for Microblog Retrieval

Typical pseudo-relevance feedback models assume that the first-pass documents are the most relevant and use those documents to select feedback terms for query expansion. In real applications, however, short documents, such as microblogs, may not have enough information about the searched topic, thus increasing the chance that irrelevant documents will be included in the initial set of retrieved documents. This situation reduces a feedback model’s ability to capture information that is relevant to users’ needs, which makes determining the best documents for relevant feedback without requiring extra effort from the user a critical challenge. In this paper, we propose an innovative mechanism to automatically select useful feedback documents using a topic modeling technique to improve the effectiveness of pseudo-relevance feedback models. The main idea behind the proposed model is to discover the latent topics in the top-ranked documents that allow for the exploitation of the correlation between terms in relevant topics. To capture discriminative terms for query expansion, we incorporated topical features into a relevance model that focuses on the temporal information in the selected set of documents. Experimental results on TREC 2011–2013 microblog datasets illustrate that the proposed model significantly outperforms all state-of-the-art baseline models.

[1]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[2]  Xiangji Huang,et al.  TopPRF: A Probabilistic Framework for Integrating Topic Space into Pseudo Relevance Feedback , 2016, TOIS.

[3]  Yue Xu,et al.  Pattern-based Topics for Document Modelling in Information Filtering , 2014, IEEE Transactions on Knowledge and Data Engineering.

[4]  David Buttler,et al.  Latent topic feedback for information retrieval , 2011, KDD.

[5]  Kazuhiro Seki,et al.  Improving pseudo-relevance feedback via tweet selection , 2013, CIKM.

[6]  Craig MacDonald,et al.  On sparsity and drift for effective real-time filtering in microblogs , 2013, CIKM.

[7]  W. Bruce Croft,et al.  Latent concept expansion using markov random fields , 2007, SIGIR.

[8]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[9]  Jianwu Yang,et al.  Knowledge-Based Query Expansion in Real-Time Microblog Search , 2015, AIRS.

[10]  Jimmy J. Lin,et al.  Overview of the TREC-2013 Microblog Track , 2013, TREC.

[11]  Iadh Ounis,et al.  Overview of the TREC 2011 Microblog Track , 2011, TREC.

[12]  Claudio Carpineto,et al.  A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[13]  Heyan Huang,et al.  Query Expansion Based on a Feedback Concept Model for Microblog Retrieval , 2017, WWW.

[14]  Fernando Diaz,et al.  Time is of the essence: improving recency ranking using Twitter data , 2010, WWW '10.

[15]  Raymond Y. K. Lau,et al.  A two-stage decision model for information filtering , 2012, Decis. Support Syst..

[16]  Haixun Wang,et al.  Transfer Understanding from Head Queries to Tail Queries , 2014, CIKM.

[17]  Jimmy J. Lin,et al.  Temporal feedback for tweet search with non-parametric density estimation , 2014, SIGIR.

[18]  W. Bruce Croft,et al.  Time-based language models , 2003, CIKM '03.

[19]  W. Bruce Croft,et al.  Quality models for microblog retrieval , 2012, CIKM.

[20]  ChengXiang Zhai,et al.  Adaptive relevance feedback in information retrieval , 2009, CIKM.

[21]  Yue Xu,et al.  Effective pseudo-relevance for Microblog retrieval , 2017, ACSW.

[22]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[23]  Yuefeng Li,et al.  Effective 20 Newsgroups Dataset Cleaning , 2015, 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT).

[24]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[25]  Fernando Diaz,et al.  UMass at TREC 2004: Novelty and HARD , 2004, TREC.

[26]  Yuefeng Li,et al.  Mining positive and negative patterns for relevance feature discovery , 2010, KDD.

[27]  Evangelos Kanoulas,et al.  Dynamic Clustering of Streaming Short Documents , 2016, KDD.

[28]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[29]  Chao Lv,et al.  Improving Microblog Retrieval with Feedback Entity Model , 2015, CIKM.

[30]  Eugene Agichtein,et al.  Leveraging geographical metadata to improve search over social media , 2013, WWW.

[31]  Miles Efron,et al.  Estimation methods for ranking recent information , 2011, SIGIR.

[32]  Jeffrey Heer,et al.  Topic Model Diagnostics: Assessing Domain Relevance via Topical Alignment , 2013, ICML.

[33]  W. Bruce Croft,et al.  Temporal models for microblogs , 2012, CIKM.

[34]  Yue Xu,et al.  Selected new training documents to update user profile , 2010, CIKM.

[35]  Yuefeng Li,et al.  Relevance Feature Discovery for Text Mining , 2014, IEEE Transactions on Knowledge and Data Engineering.

[36]  Chen Lin,et al.  Generating event storylines from microblogs , 2012, CIKM.