Query-based unsupervised learning for improving social media search

In the current information era over the internet, social media has become one of the essential information sources for users. While the text is the primary information representation, finding relevant information is a challenging mission for researchers due to its nature (e.g., short length, sparseness). Acquiring high-quality search results from massive data, such as social media needs a set of representative query terms that are not always available. In this paper, we propose a novel query-based unsupervised learning model to represent the implicit relationships in the short text from social media. This bridges the gap of the lack of word co-occurrences without requiring many parameters to be estimated and external evidence to be collected. To confirm the proposed model effectiveness, we compare the proposed model with state-of-the-art lexical, topic model and temporal models on the large-scale TREC microblog 2011-2014 collections. The experimental results show that the proposed model significantly improved overall state-of-the-art lexical, topic model and temporal models with the maximum percentage of increase reaching 33.97% based on MAP value and 21.38% based on Precision at top 30 documents. The proposed model can improve the social media search effectiveness in potential closely retrieval tasks, such as question answering and timeline summarisation.

[1]  Yuefeng Li,et al.  Effective Pattern Discovery for Text Mining , 2012, IEEE Transactions on Knowledge and Data Engineering.

[2]  Chao Lv,et al.  Improving Microblog Retrieval with Feedback Entity Model , 2015, CIKM.

[3]  Yuefeng Li,et al.  Effective 20 Newsgroups Dataset Cleaning , 2015, 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT).

[4]  Qiang Yang,et al.  Transferring topical knowledge from auxiliary long texts for short text clustering , 2011, CIKM '11.

[5]  G. C. Gupta WEB INTELLIGENCE , 2004 .

[6]  Shengli Wu,et al.  Search result diversification via data fusion , 2014, SIGIR.

[7]  Heyan Huang,et al.  Query Expansion Based on a Feedback Concept Model for Microblog Retrieval , 2017, WWW.

[8]  Xiuzhen Zhang,et al.  A probabilistic method for emerging topic tracking in Microblog stream , 2016, World Wide Web.

[9]  Scott Sanner,et al.  Improving LDA topic models for microblogs via tweet pooling and automatic labeling , 2013, SIGIR.

[10]  Ting Wang,et al.  An effective approach to tweets opinion retrieval , 2015, World Wide Web.

[11]  Raymond Y. K. Lau,et al.  Finding Semantically Valid and Relevant Topics by Association-Based Topic Selection Model , 2017, ACM Trans. Intell. Syst. Technol..

[12]  Hui Xiong,et al.  Topic Modeling of Short Texts: A Pseudo-Document View , 2016, KDD.

[13]  Yuefeng Li,et al.  Relevance Feature Discovery for Text Mining , 2014, IEEE Transactions on Knowledge and Data Engineering.

[14]  Qinmin Hu,et al.  TAKer: Fine-Grained Time-Aware Microblog Search with Kernel Density Estimation , 2018, IEEE Transactions on Knowledge and Data Engineering.

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Fernando Diaz,et al.  UMass at TREC 2004: Novelty and HARD , 2004, TREC.

[17]  KhreichWael,et al.  A Survey of Techniques for Event Detection in Twitter , 2015, CI 2015.

[18]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[19]  Xueqi Cheng,et al.  Ranking Tweets by Labeled and Collaboratively Selected Pairs with Transitive Closure , 2014, AAAI.

[20]  W. Bruce Croft,et al.  Time-based language models , 2003, CIKM '03.

[21]  Luo Si,et al.  Learning for Efficient Supervised Query Expansion via Two-stage Feature Selection , 2016, SIGIR.

[22]  Jaegul Choo,et al.  Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations , 2018, WWW.

[23]  Kazuhiro Seki,et al.  Improving pseudo-relevance feedback via tweet selection , 2013, CIKM.

[24]  Craig MacDonald,et al.  On sparsity and drift for effective real-time filtering in microblogs , 2013, CIKM.

[25]  Yiyu Yao,et al.  An interview with Professor Raj Reddy on Web Intelligence (WI) and Computational Social Science (CSS) , 2018, Web Intell..

[26]  Jimmy J. Lin,et al.  Overview of the TREC-2013 Microblog Track , 2013, TREC.

[27]  Wael Khreich,et al.  A Survey of Techniques for Event Detection in Twitter , 2015, Comput. Intell..

[28]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[29]  Chao Lv,et al.  PKUICST at TREC 2014 Microblog Track: Feature Extraction for Effective Microblog Search and Adaptive Clustering Algorithms for TTG , 2014, TREC.

[30]  M. de Rijke,et al.  A syntax-aware re-ranker for microblog retrieval , 2014, SIGIR.

[31]  Iadh Ounis,et al.  Overview of the TREC 2011 Microblog Track , 2011, TREC.

[32]  James P. Callan,et al.  Barbara Made the News: Mining the Behavior of Crowds for Time-Aware Learning to Rank , 2016, WSDM.

[33]  Yue Xu,et al.  Pattern-based Topics for Document Modelling in Information Filtering , 2014, IEEE Transactions on Knowledge and Data Engineering.

[34]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[35]  Hong Cheng,et al.  The dual-sparse topic model: mining focused topics and focused terms in short text , 2014, WWW.

[36]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[37]  Yue Xu,et al.  Query-Based Automatic Training Set Selection for Microblog Retrieval , 2018, PAKDD.

[38]  Fernando Diaz,et al.  Time is of the essence: improving recency ranking using Twitter data , 2010, WWW '10.

[39]  Jimmy J. Lin,et al.  Temporal feedback for tweet search with non-parametric density estimation , 2014, SIGIR.

[40]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[41]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[42]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[43]  Qi He,et al.  TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[44]  Yuefeng Li,et al.  Extending Embedding Representation by Incorporating Latent Relations , 2018, IEEE Access.

[45]  Miles Efron,et al.  Estimation methods for ranking recent information , 2011, SIGIR.

[46]  Yue Xu,et al.  Effective pseudo-relevance for Microblog retrieval , 2017, ACSW.

[47]  Feng Liang,et al.  Exploiting ranking factorization machines for microblog retrieval , 2013, CIKM.