Microblog Retrieval Using Ensemble of Feature Sets through Supervised Feature Selection

Microblog, especially twitter, has become an integral part of our daily life for searching latest news and events information. Due to the short length characteristics of tweets and frequent use of unconventional abbreviations, content-relevance based search cannot satisfy user’s information need. Recent research has shown that considering temporal and contextual aspects in this regard has improved the retrieval performance significantly. In this paper, we focus on microblog retrieval, emphasizing the alleviation of the vocabulary mismatch, and the leverage of the temporal (e.g., recency and burst nature) and contextual characteristics of tweets. To address the temporal and contextual aspect of tweets, we propose new features based on query-tweet time, word embedding, and query-tweet sentiment correlation. We also introduce some popularity features to estimate the importance of a tweet. A three-stage query expansion technique is applied to improve the relevancy of tweets. Moreover, to determine the temporal and sentiment sensitivity of a query, we introduce query type determination techniques. After supervised feature selection, we apply random forest as a feature ranking method to estimate the importance of selected features. Then, we make use of ensemble of learning to rank (L2R) framework to estimate the relevance of query-tweet pair. We conducted experiments on TREC Microblog 2011 and 2012 test collections over the TREC Tweets2011 corpus. Experimental results demonstrate the effectiveness of our method over the baseline and known related works in terms of precision at 30 (P@30), mean average precision (MAP), normalized discounted cumulative gain at 30 (NDCG@30), and R-precision (R-Prec) metrics. key words: microblog search, temporal information retrieval, query expansion, feature selection, learning to rank, time-aware ranking

[1]  Bhaskar Mitra,et al.  Improving Document Ranking with Dual Word Embeddings , 2016, WWW.

[2]  Giorgio Gambosi,et al.  On relevance, time and query expansion , 2011, CIKM '11.

[3]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[4]  Miles Efron,et al.  Estimation methods for ranking recent information , 2011, SIGIR.

[5]  Joemon M. Jose,et al.  On Microblog Dimensionality and Informativeness: Exploiting Microblogs' Structure and Dimensions for Ad-Hoc Retrieval , 2015, ICTIR.

[6]  M. de Rijke,et al.  A syntax-aware re-ranker for microblog retrieval , 2014, SIGIR.

[7]  Mohand Boughanem,et al.  Effectiveness of state-of-the-art features for microblog search , 2013, SAC '13.

[8]  Jimmy J. Lin,et al.  Overview of the TREC-2013 Microblog Track , 2013, TREC.

[9]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[10]  Masaki Aono,et al.  Combining temporal and content aware features for microblog retrieval , 2015, 2015 2nd International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA).

[11]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[12]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[13]  Timothy Baldwin,et al.  Automatically Constructing a Normalisation Dictionary for Microblogs , 2012, EMNLP.

[14]  Yihong Hong,et al.  PKUICST at TREC 2012 Microblog Track , 2012, TREC.

[15]  Andrew Gelman,et al.  Bursts: The Hidden Pattern Behind Everything We Do , 2010 .

[16]  Kazuhiro Seki,et al.  Combining Recency and Topic-Dependent Temporal Variation for Microblog Search , 2013, ECIR.

[17]  Yubin Kim,et al.  Overcoming Vocabulary Limitations in Twitter Microblogs , 2012, TREC.

[18]  M. de Rijke,et al.  Incorporating Query Expansion and Quality Indicators in Searching Microblog Posts , 2011, ECIR.

[19]  Harry Shum,et al.  An Empirical Study on Learning to Rank of Tweets , 2010, COLING.

[20]  Fei Liu,et al.  A Broad-Coverage Normalization System for Social Media Language , 2012, ACL.

[21]  Iadh Ounis,et al.  Overview of the TREC 2011 Microblog Track , 2011, TREC.

[22]  Stephen E. Robertson,et al.  Probabilistic models in IR and their relationships , 2014, Information Retrieval.

[23]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[24]  Jimmy J. Lin,et al.  Temporal feedback for tweet search with non-parametric density estimation , 2014, SIGIR.

[25]  Jimmy J. Lin,et al.  Temporal Query Expansion Using a Continuous Hidden Markov Model , 2016, ICTIR.

[26]  Joemon M. Jose,et al.  University of Glasgow (UoG_TwTeam) at TREC Microblog 2013 , 2013, TREC.

[27]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[28]  Chao Lv,et al.  Improving Microblog Retrieval with Feedback Entity Model , 2015, CIKM.

[29]  Mandar Mitra,et al.  Word Embedding based Generalized Language Model for Information Retrieval , 2015, SIGIR.

[30]  Tiejun Zhao,et al.  HIT at TREC 2012 Microblog Track , 2012, TREC.

[31]  Gilles Louppe,et al.  Understanding variable importances in forests of randomized trees , 2013, NIPS.

[32]  Giorgio Gambosi,et al.  FUB, IASI-CNR, UNIVAQ at TREC 2011 Microblog Track , 2011, TREC.

[33]  Walid Magdy,et al.  QCRI at TREC 2013 Microblog Track , 2013, TREC.

[34]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[35]  Rodrygo L. T. Santos Explicit web search result diversification , 2013, SIGF.

[36]  Ion Androutsopoulos,et al.  Learning Textual Entailment using SVMs and String Similarity Measures , 2007, ACL-PASCAL@ACL.

[37]  Charles L. A. Clarke,et al.  Reciprocal rank fusion outperforms condorcet and individual rank learning methods , 2009, SIGIR.

[38]  Kazuhiro Seki,et al.  Improving pseudo-relevance feedback via tweet selection , 2013, CIKM.

[39]  W. Bruce Croft,et al.  Quality models for microblog retrieval , 2012, CIKM.

[40]  Yi Zeng,et al.  A Weighted Multi-factor Algorithm for Microblog Search , 2011, AMT.

[41]  Albert-László Barabási,et al.  Bursts: The Hidden Pattern Behind Everything We Do , 2010 .

[42]  Clement T. Yu,et al.  The Impacts of Structural Difference and Temporality of Tweets on Retrieval Effectiveness , 2013, TOIS.

[43]  Meredith Ringel Morris,et al.  #TwitterSearch: a comparison of microblog search and web search , 2011, WSDM '11.

[44]  Mike Thelwall,et al.  Sentiment strength detection for the social web , 2012, J. Assoc. Inf. Sci. Technol..

[45]  Feng Liang,et al.  Exploiting real-time information retrieval in the microblogosphere , 2012, JCDL '12.

[46]  Craig MacDonald,et al.  Overview of the TREC-2012 Microblog Track , 2012, Text Retrieval Conference.

[47]  Donald Metzler,et al.  USC/ISI at TREC 2011: Microblog Track , 2011, TREC.

[48]  Martine De Cock,et al.  Ranking Approaches for Microblog Search , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.