Tweet Contextualization: a Strategy Based on Document Retrieval Using Query Enrichment and Automatic Summarization

The aim of the tweet contextualization INEX (Initiative for the Evaluation of XML retrieval) task at CLEF 2013 (Conference and Labs of the Evaluation Forum) is to build a system that provides automatically information related with different tweets, that is, a summary that explains a specific tweet. In this article, our strategy and results are presented. The methodology for the task in English includes three stages. First, automatic reformulations of the initial queries provided for the task, that is, the tweets, are performed. In this research, we use words sequences that agree with the typical terminological patterns, name entities, hashtags and Twitter users accounts, since we consider that they are representative of tweets’ topics. Second, related documents are retrieved from Wikipedia with the search engine Indri, using the reformulated queries. Third, the obtained documents are summarized by using two different automatic summarization systems, in order to provide the final summary associated to each query. Regarding the pilot task for Spanish, our strategy includes a first stage where automatic reformulations of the initial queries provided for the task (similar to English) are carried out. However, it does not include neither the search engine Indri nor the summarization systems REG and Cortex. In this case, we directly extract relevant text passages from Wikipedia pages using the generated queries and we build the summary with the first sentences of these pages.

[1]  M. Teresa Cabré Castellví,et al.  Automatic term detection , 2001 .

[2]  Iria da Cunha,et al.  QA@INEX Track 2011: Question Expansion and Reformulation Using the REG Summarization System , 2011, INEX.

[3]  Iria da Cunha,et al.  INEX Tweet Contextualization Track at CLEF 2012: Query Reformulation using Terminological Patterns and Automatic Summarization , 2012, CLEF.

[4]  M. Teresa Cabré Castellví,et al.  Automatic term detection: A review of current systems , 2001 .

[5]  Caroline Wilkinson,et al.  A Review of Current Systems , 2005 .

[6]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[7]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[8]  W. Bruce Croft,et al.  Indri: A language-model based search engine for complex queries1 , 2005 .

[9]  Iria da Cunha,et al.  The REG Summarization System with Question Reformulation at QA@INEX Track 2010 , 2010, INEX.

[10]  Juan-Manuel Torres-Moreno,et al.  Condensés de textes par des méthodes numériques , 2012, ArXiv.

[11]  Paolo Ferragina,et al.  TAGME: on-the-fly annotation of short text fragments (by wikipedia entities) , 2010, CIKM.

[12]  Mark Chignell,et al.  Proceedings of the 21st ACM conference on Hypertext and hypermedia , 2010, Hypertext 2010.

[13]  Eric SanJuan,et al.  Multilingual Summarization Evaluation without Human Models , 2010, COLING.

[14]  Efthimis N. Efthimiadis,et al.  Conversational tagging in twitter , 2010, HT '10.

[15]  Horacio Saggion,et al.  Generating Indicative-Informative Summaries with SumUM , 2002, Computational Linguistics.

[16]  Iryna Gurevych,et al.  Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary , 2008, LREC.

[17]  Juan-Manuel Torres-Moreno,et al.  Condens\'es de textes par des m\'ethodes num\'eriques , 2012 .

[18]  Jorge Vivaldi Palatresi Extracción de candidatos a término mediante la combinación de estrategias heterogéneas , 2001 .