Building simulated queries for known-item topics: an analysis using six european languages

There has been increased interest in the use of simulated queries for evaluation and estimation purposes in Information Retrieval. However, there are still many unaddressed issues regarding their usage and impact on evaluation because their quality, in terms of retrieval performance, is unlike real queries. In this paper, wefocus on methods for building simulated known-item topics and explore their quality against real known-item topics. Using existing generation models as our starting point, we explore factors which may influence the generation of the known-item topic. Informed by this detailed analysis (on six European languages) we propose a model with improved document and term selection properties, showing that simulated known-item topics can be generated that are comparable to real known-item topics. This is a significant step towards validating the potential usefulness of simulated queries: for evaluation purposes, and becausebuilding models of querying behavior provides a deeper insight into the querying process so that better retrieval mechanisms can be developed to support the user.

[1]  M. de Rijke,et al.  EuroGOV: Engineering a Multilingual Web Corpus , 2005, CLEF.

[2]  Masashi Inoue The Remarkable Search Topic-Finding Task to Share Success Stories of Cross-Language Information Retrieval , 2006 .

[3]  Valentin Jijkoun,et al.  Overview of WebCLEF 2007 , 2008, CLEF.

[4]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[5]  M. de Rijke,et al.  Automatic construction of known-item finding test beds , 2006, SIGIR.

[6]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[7]  Berthier A. Ribeiro-Neto,et al.  Searching web databases by structuring keyword-based queries , 2002, CIKM '02.

[8]  John D. Lafferty,et al.  Information Retrieval as Statistical Translation , 2017 .

[9]  Joemon M. Jose,et al.  Automatic query expansion based on divergence , 2001, CIKM '01.

[10]  Ingmar Weber,et al.  Type less, find more: fast autocompletion search with a succinct index , 2006, SIGIR.

[11]  Bernard P. Zeigler,et al.  Theory of Modelling and Simulation , 1979, IEEE Transactions on Systems, Man and Cybernetics.

[12]  M. de Rijke,et al.  Monolingual Document Retrieval for European Languages , 2004, Information Retrieval.

[13]  Qigang Gao,et al.  Using controlled query generation to evaluate blind relevance feedback algorithms , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[14]  Jean Tague-Sutcliffe,et al.  Simulation of User Judgments in Bibliographic Retrieval Systems , 1981, SIGIR.

[15]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[16]  Naonori Ueda,et al.  Retrieving lightly annotated images using image similarities , 2005, SAC '05.

[17]  Jean Tague-Sutcliffe,et al.  Problems in the simulation of bibliographic retrieval systems , 1980, SIGIR '80.

[18]  M. de Rijke,et al.  Overview of WebCLEF 2005 , 2005, CLEF.