Improving Arabic Microblog Retrieval with Distributed Representations

Query expansion (QE) using pseudo relevance feedback (PRF) is one of the approaches that has been shown to be effective for improving microblog retrieval. In this paper, we investigate the performance of three different embedding-based methods on Arabic microblog retrieval: Embedding-based QE, Embedding-based PRF, and PRF incorporated with embedding-based reranking. Our experimental results over three variants of EveTAR test collection showed a consistent improvement of the reranking method over the traditional PRF baseline using both MAP and P@10 evaluation measures. The improvement is statistically-significant in some cases. However, while the embedding-based QE fails to improve over the traditional PRF, the embedding-based PRF successfully outperforms the baseline in several cases, with a statistically-significant improvement using MAP measure over two variants of the test collection.

[1]  Bhaskar Mitra,et al.  Improving Document Ranking with Dual Word Embeddings , 2016, WWW.

[2]  Kam-Fai Wong,et al.  Ranking model selection and fusion for effective microblog search , 2014, SoMeRA@SIGIR.

[3]  Nick Craswell,et al.  Query Expansion with Locally-Trained Word Embeddings , 2016, ACL.

[4]  Bhaskar Mitra,et al.  An Introduction to Neural Information Retrieval , 2018, Found. Trends Inf. Retr..

[5]  Zhong Zhou,et al.  Tweet2Vec: Character-Based Distributed Representations for Social Media , 2016, ACL.

[6]  James P. Callan,et al.  Learning to Reweight Terms with Distributed Representations , 2015, SIGIR.

[7]  Walid Magdy,et al.  QCRI at TREC 2013 Microblog Track , 2013, TREC.

[8]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[9]  David Carmel,et al.  Query Expansion for Email Search , 2017, SIGIR.

[10]  Md. Mustafizur Rahman,et al.  Neural information retrieval: at the end of the early years , 2017, Information Retrieval Journal.

[11]  Walid Magdy,et al.  Hyperlink-extended pseudo relevance feedback for improved microblog retrieval , 2014, SoMeRA@SIGIR.

[12]  Samhaa R. El-Beltagy,et al.  AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP , 2017, ACLING.

[13]  Oren Kurland,et al.  Query Expansion Using Word Embeddings , 2016, CIKM.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  M. de Rijke,et al.  Short Text Similarity with Word Embeddings , 2015, CIKM.

[16]  Tiejun Zhao,et al.  HIT at TREC 2012 Microblog Track , 2012, TREC.

[17]  Tamer Elsayed,et al.  EveTAR: building a large-scale multi-task test collection over Arabic tweets , 2017, Information Retrieval Journal.

[18]  Katrina Fenlon,et al.  Improving retrieval of short texts through document expansion , 2012, SIGIR '12.

[19]  Said Ouatik El Alaoui,et al.  Word-embedding-based pseudo-relevance feedback for Arabic information retrieval , 2018, J. Inf. Sci..

[20]  Allan Hanbury,et al.  Word Embedding Causes Topic Shifting; Exploit Global Context! , 2017, SIGIR.

[21]  W. Bruce Croft,et al.  Embedding-based Query Language Models , 2016, ICTIR.

[22]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[23]  Gareth J. F. Jones,et al.  Representing Documents and Queries as Sets of Word Embedded Vectors for Information Retrieval , 2016, ArXiv.

[24]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[25]  Kazuhiro Seki,et al.  Improving pseudo-relevance feedback via tweet selection , 2013, CIKM.

[26]  Tamer Elsayed,et al.  QU at TREC-2014: Online Clustering with Temporal and Topical Expansion for Tweet Timeline Generation , 2014, TREC.

[27]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[28]  Nadir Durrani,et al.  Farasa: A Fast and Furious Segmenter for Arabic , 2016, NAACL.