A survey on session detection methods in query logs and a proposal for future evaluation

Search engine logs provide a highly detailed insight of users' interactions. Hence, they are both extremely useful and sensitive. The datasets publicly available to scholars are, unfortunately, too few, too dated and too small. There are few because search engine companies are reluctant to release such data; they are dated because they were collected in late 1990s or early 2000s; and they are small because they comprise data for at most one day and just a few hundreds of thousands of users. Even worse, the large query log disclosed by AOL in 2006 caused more harm than good because of a big privacy flaw. In this paper the author provides an overall view of the possible applications of query logs, the privacy concerns researchers must face when working on such datasets, and several ways in which query logs can be easily sanitized. One of such measures consists of segmenting the logs into short topical sessions. Therefore, the author offers a comprehensive survey of session detection methods, as well as a thorough description of a new evaluation framework with performance results for each of the different methods. Additionally, a new, simple, but outperforming session detection method is proposed. It is a heuristic-based technique which works on the basis of a geometric interpretation of both the time gap between queries and the similarity between them in order to flag a topic shift.

[1]  Qiang Yang,et al.  Web-page summarization using clickthrough data , 2005, SIGIR '05.

[2]  Amanda Spink,et al.  Automated gathering of Web information: An in-depth examination of agents interacting with search engines , 2006, TOIT.

[3]  Dell Zhang,et al.  A novel Web usage mining approach for search engines , 2002, Comput. Networks.

[4]  Amanda Spink,et al.  An analysis of Web searching by European AlltheWeb.com users , 2005, Inf. Process. Manag..

[5]  Ricardo Baeza-Yates,et al.  Web Usage Mining in Search Engines , 2005 .

[6]  Ji-Rong Wen,et al.  Query Clustering in the Web Context , 2003, Clustering and Information Retrieval.

[7]  Bernard J. Jansen Limits of the Web Log Analysis Artifacts , 2006 .

[8]  Amanda Spink,et al.  An Analysis of Web Documents Retrieved and Viewed , 2003, International Conference on Internet Computing.

[9]  Mark D. Smucker,et al.  Information Retrieval , 2017, Lecture Notes in Computer Science.

[10]  Amanda Spink,et al.  Real life information retrieval: a study of user queries on the Web , 1998, SIGF.

[11]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[12]  Steve Chien,et al.  Semantic similarity between search engine queries using temporal correlation , 2005, WWW '05.

[13]  Amanda Spink,et al.  A temporal comparison of AltaVista Web searching , 2005, J. Assoc. Inf. Sci. Technol..

[14]  Shui-Lung Chuang,et al.  Enriching Web taxonomies through subject categorization of query terms from search engine logs , 2003, Decis. Support Syst..

[15]  Daqing He,et al.  Detecting session boundaries from Web user logs , 2000 .

[16]  Seda Özmutlu,et al.  Automatic new topic identification in search engine transaction logs , 2006, Internet Res..

[17]  Amanda Spink,et al.  Automatic New Topic Identification in Search Engine Transaction Logs  Using Multiple Linear Regression , 2008, Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008).

[18]  Amanda Spink,et al.  Model for organizational knowledge creation and strategic use of information: Research Articles , 2005 .

[19]  Wei-Ying Ma,et al.  Probabilistic query expansion using query logs , 2002, WWW '02.

[20]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[21]  Doug Downey,et al.  Models of Searching and Browsing: Languages, Studies, and Application , 2007, IJCAI.

[22]  Amanda Spink,et al.  Methodological approach in discovering user search patterns through Web log analysis , 2005 .

[23]  Seda Ozmutlu Automatic new topic identification using multiple linear regression , 2006 .

[24]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[25]  Seda Özmutlu,et al.  Neural network applications for automatic new topic identification , 2005, Online Inf. Rev..

[26]  Amanda Spink,et al.  Multitasking during Web search sessions , 2006, Inf. Process. Manag..

[27]  Nuno Seco,et al.  Detecting user sessions in the tumba! web log , 2006 .

[28]  Daniel E. Rose,et al.  Understanding user goals in web search , 2004, WWW '04.

[29]  Mark Hansen,et al.  Using navigation data to improve IR functions in the context of web search , 2001, CIKM '01.

[30]  Amanda Spink,et al.  Modeling Users' Successive Searches in Digital Environments: A National Science Foundation/British Library Funded Study , 1998, D Lib Mag..

[31]  Massimo Barbaro,et al.  A Face Is Exposed for AOL Searcher No , 2006 .

[32]  Ji-Rong Wen,et al.  Clustering user queries of a search engine , 2001, WWW '01.

[33]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[34]  Ji-Rong Wen,et al.  Query clustering using user logs , 2002, TOIS.

[35]  Filip Radlinski,et al.  Query chains: learning to rank from implicit feedback , 2005, KDD '05.

[36]  T. R. Girill Online access AIDS for documentation: a bibliographic outline , 1985, SIGF.

[37]  Daqing He,et al.  Analysing Web Search Logs to Determine Session Boundaries for User-Oriented Learning , 2000, AH.

[38]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[39]  David D. Lewis,et al.  Evaluating Text Categorization I , 1991, HLT.

[40]  Evangelos P. Markatos,et al.  On caching search engine query results , 2001, Comput. Commun..

[41]  Amanda Spink,et al.  Use of query reformulation and relevance feedback by Excite users , 2000, Internet Res..

[42]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[43]  B. C. Brookes,et al.  Information Sciences , 2020, Cognitive Skills You Need for the 21st Century.

[44]  Alistair Moffat,et al.  Some Observations on User Search Behaviour , 2006, Aust. J. Intell. Inf. Process. Syst..

[45]  Amanda Spink,et al.  Multitasking information seeking and searching processes , 2002, J. Assoc. Inf. Sci. Technol..

[46]  Eric Brill,et al.  Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users , 2004, EMNLP.

[47]  Seda Özmutlu,et al.  Using conditional probabilities for automatic new topic identification , 2007, Online Inf. Rev..

[48]  Amanda Spink,et al.  Defining a session on Web search engines: Research Articles , 2007 .

[49]  Thorsten Joachims,et al.  Accurately Interpreting Clickthrough Data as Implicit Feedback , 2017 .

[50]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[51]  Ralph Weischedel,et al.  PERFORMANCE MEASURES FOR INFORMATION EXTRACTION , 2007 .

[52]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[53]  Nuno Seco,et al.  Detecting User Sessions in the Tumba ! Query Log , 2006 .

[54]  Richard Sproat,et al.  The First International Chinese Word Segmentation Bakeoff , 2003, SIGHAN.

[55]  Amanda Spink,et al.  From E-Sex to E-Commerce: Web Search Changes , 2002, Computer.

[56]  Nancy Chinchor Four scorers and seven years ago: the scoring method for MUC-6 , 1995, MUC.

[57]  Eric Horvitz,et al.  Patterns of search: analyzing and modeling Web query refinement , 1999 .

[58]  Susan T. Dumais,et al.  Similarity Measures for Short Segments of Text , 2007, ECIR.

[59]  Nikolai Buzikashvili,et al.  An exploratory web log study of multitasking , 2006, SIGIR.

[60]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[61]  E. Rasmussen Evaluation in Information Retrieval , 2002 .

[62]  Huseyin Cenk Özmutlu,et al.  Application of automatic topic identification on Excite Web search engine data logs , 2005, Inf. Process. Manag..

[63]  Wei-Ying Ma,et al.  Query Expansion by Mining User Logs , 2003, IEEE Trans. Knowl. Data Eng..

[64]  Seda Özmutlu,et al.  Using monte-carlo simulation for automatic new topic identification of search engine transaction logs , 2007, 2007 Winter Simulation Conference.

[65]  Yen-Jen Oyang,et al.  Relevant term suggestion in interactive web search based on contextual information in query session logs , 2003, J. Assoc. Inf. Sci. Technol..

[66]  Eytan Adar,et al.  User 4XXXXX9: Anonymizing Query Logs , 2007 .

[67]  Amanda Spink,et al.  How are we searching the World Wide Web? A comparison of nine search engine transaction logs , 2006, Inf. Process. Manag..

[68]  Don R. Swanson,et al.  Information Retrieval as a Trial-And-Error Process , 1977, The Library Quarterly.

[69]  David M. Kristol,et al.  HTTP State Management Mechanism , 1997, RFC.

[70]  Amanda Spink,et al.  U.S. versus European web searching trends , 2002, SIGF.

[71]  Daqing He,et al.  Combining evidence for automatic Web session identification , 2002, Inf. Process. Manag..

[72]  Shui-Lung Chuang,et al.  A practical web-based approach to generating topic hierarchy for text segments , 2004, CIKM '04.

[73]  Wei-Ying Ma,et al.  Optimizing web search using web click-through data , 2004, CIKM '04.

[74]  Amanda Spink,et al.  Defining a session on Web search engines , 2007, J. Assoc. Inf. Sci. Technol..

[75]  D. Metcalf On Relevance , 1999, Stem cells.

[76]  Nikolai Buzikashvili Automatic Task Detection in the Web Logs and Analysis of Multitasking , 2006, ICADL.

[77]  Seda Özmutlu,et al.  Cross-validation of neural network applications for automatic new topic identification , 2008, J. Assoc. Inf. Sci. Technol..

[78]  Jimmy Lin,et al.  Identification of user sessions with hierarchical agglomerative clustering , 2006, ASIST.

[79]  Anthony Scime,et al.  Web Mining: Applications and Techniques , 2004 .

[80]  Yinglian Xie,et al.  Locality in search engine queries and its implications for caching , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[81]  Nikolai Buzikashvili Sliding window technique for the web log analysis , 2007, WWW '07.

[82]  Amanda Spink,et al.  Vox populi: The public searching of the web , 2001, J. Assoc. Inf. Sci. Technol..

[83]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[84]  Eugene Agichtein,et al.  Towards Privacy-Preserving Query Log Publishing , 2007 .

[85]  ChengXiang Zhai,et al.  Implicit user modeling for personalized search , 2005, CIKM '05.

[86]  Christopher C. Yang,et al.  Mining related queries from search engine query logs , 2006, WWW '06.