An Embedding Based IR Model for Disaster Situations

Twitter (http://twitter.com) is one of the most popular social networking platforms. Twitter users can easily broadcast disaster-specific information, which, if effectively mined, can assist in relief operations. However, the brevity and informal nature of tweets pose a challenge to Information Retrieval (IR) researchers. In this paper, we successfully use word embedding techniques to improve ranking for ad-hoc queries on microblog data. Our experiments with the ‘Social Media for Emergency Relief and Preparedness’ (SMERP) dataset provided at an ECIR 2017 workshop show that these techniques outperform conventional term-matching based IR models. In addition, we show that, for the SMERP task, our word embedding based method is more effective if the embeddings are generated from the disaster specific SMERP data, than when they are trained on the large social media collection provided for the TREC (http://trec.nist.gov/) 2011 Microblog track dataset.

[1]  Iadh Ounis,et al.  Overview of the TREC 2011 Microblog Track , 2011, TREC.

[2]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[3]  Sarah Vieweg,et al.  Processing Social Media Messages in Mass Emergency , 2014, ACM Comput. Surv..

[4]  Jong-Hoon Oh,et al.  Aid is Out There: Looking for Help from Tweets during a Large Scale Disaster , 2013, ACL.

[5]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[6]  Miles Efron,et al.  Hashtag retrieval in a microblogging environment , 2010, SIGIR.

[7]  M. de Rijke,et al.  Incorporating Query Expansion and Quality Indicators in Searching Microblog Posts , 2011, ECIR.

[8]  Dong Wang,et al.  Document classification with distributions of word vectors , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[9]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[10]  van Gerardus Noord,et al.  Special issue: finite state methods in natural language processing , 2003 .

[11]  Vasudeva Varma,et al.  Doc2Sent2Vec: A Novel Two-Phase Approach for Learning Document Representation , 2016, SIGIR.

[12]  Prasenjit Majumder,et al.  Query Expansion for Microblog Retrieval , 2011, TREC.

[13]  Gilad Mishne,et al.  Towards recency ranking in web search , 2010, WSDM '10.

[14]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[15]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[16]  W. Bruce Croft,et al.  A Language Modeling Approach to Information Retrieval , 1998, SIGIR Forum.

[17]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[18]  Sungzoon Cho,et al.  Bag-of-concepts: Comprehending document representation through clustering words in distributed representation , 2017, Neurocomputing.

[19]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[20]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[21]  Francesco Romani,et al.  Ranking a stream of news , 2005, WWW '05.