Identifying task-based sessions in search engine query logs

The research challenge addressed in this paper is to devise effective techniques for identifying task-based sessions, i.e. sets of possibly non contiguous queries issued by the user of a Web Search Engine for carrying out a given task. In order to evaluate and compare different approaches, we built, by means of a manual labeling process, a ground-truth where the queries of a given query log have been grouped in tasks. Our analysis of this ground-truth shows that users tend to perform more than one task at the same time, since about 75% of the submitted queries involve a multi-tasking activity. We formally define the Task-based Session Discovery Problem (TSDP) as the problem of best approximating the manually annotated tasks, and we propose several variants of well known clustering algorithms, as well as a novel efficient heuristic algorithm, specifically tuned for solving the TSDP. These algorithms also exploit the collaborative knowledge collected by Wiktionary and Wikipedia for detecting query pairs that are not similar from a lexical content point of view, but actually semantically related. The proposed algorithms have been evaluated on the above ground-truth, and are shown to perform better than state-of-the-art approaches, because they effectively take into account the multi-tasking behavior of users.

[1]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[2]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[3]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[4]  ChengXiang Zhai,et al.  Implicit user modeling for personalized search , 2005, CIKM '05.

[5]  Daqing He,et al.  Combining evidence for automatic Web session identification , 2002, Inf. Process. Manag..

[6]  Rosie Jones,et al.  Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs , 2008, CIKM '08.

[7]  Amanda Spink,et al.  Defining a session on Web search engines: Research Articles , 2007 .

[8]  Amanda Spink,et al.  Multitasking during Web search sessions , 2006, Inf. Process. Manag..

[9]  Daniel E. Rose,et al.  Understanding user goals in web search , 2004, WWW '04.

[10]  Kalervo Järvelin,et al.  s-grams: Defining generalized n-grams for information retrieval , 2007, Inf. Process. Manag..

[11]  Ryen W. White,et al.  Leveraging popular destinations to enhance Web search interaction , 2008, TWEB.

[12]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[13]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[14]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[15]  Amanda Spink,et al.  How are we searching the World Wide Web? A comparison of nine search engine transaction logs , 2006, Inf. Process. Manag..

[16]  Daqing He,et al.  Detecting session boundaries from Web user logs , 2000 .

[17]  Ji-Rong Wen,et al.  Query clustering using user logs , 2002, TOIS.

[18]  Zhenyu Liu,et al.  Automatic identification of user goals in Web search , 2005, WWW '05.

[19]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[20]  Filip Radlinski Learning to rank from implicit feedback , 2008 .

[21]  Vijay V. Raghavan,et al.  On the reuse of past optimal queries , 1995, SIGIR '95.

[22]  Filip Radlinski,et al.  Query chains: learning to rank from implicit feedback , 2005, KDD '05.

[23]  Aristides Gionis,et al.  The query-flow graph: model and applications , 2008, CIKM '08.

[24]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[25]  Kenneth Wai-Ting Leung,et al.  Personalized Concept-Based Clustering of Search Engine Queries , 2008, IEEE Transactions on Knowledge and Data Engineering.

[26]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[27]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[28]  Huseyin Cenk Özmutlu,et al.  Application of automatic topic identification on Excite Web search engine data logs , 2005, Inf. Process. Manag..

[29]  Fabrizio Silvestri,et al.  Mining Query Logs , 2009, ECIR.

[30]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[31]  Natalie S. Glance,et al.  Community search assistant , 2001, IUI '01.

[32]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[33]  Fabrizio Silvestri,et al.  Mining Query Logs: Turning Search Usage Data into Knowledge , 2010, Found. Trends Inf. Retr..

[34]  Ian H. Witten,et al.  An effective, low-cost measure of semantic relatedness obtained from Wikipedia links , 2008 .

[35]  Eric Horvitz,et al.  Patterns of search: analyzing and modeling Web query refinement , 1999 .