Enhancing web search by mining search and browse logs

Huge amounts of search log data have been accumulated in various search engines. Currently, a commercial search engine receives billions of queries and collects tera-bytes of log data on any single day. Other than search log data, browse logs can be collected by client-side browser plug-ins, which record the browse information if users' permissions are granted. Such massive amounts of search/browse log data, on the one hand, provide great opportunities to mine the wisdom of crowds and improve web search results. On the other hand, designing effective and efficient methods to clean, model, and process large scale log data also presents great challenges. In this tutorial, we will focus on mining search and browse log data for search engines. We will start with an introduction of search and browse log data and an overview of frequently-used data summarization in log mining. We will then elaborate how log mining applications enhance the five major components of a search engine, namely, query understanding, document understanding, query-document matching, user understanding, and monitoring and feedbacks. For each aspect, we will survey the major tasks, fundamental principles, and state-of-the-art methods. Finally, we will discuss the challenges and future trends of log data mining. The goal of this tutorial is to provide a systematic survey on large-scale search/browse log mining to the IR community. It may help IR researchers to get familiar with the core challenges and promising directions in log mining. At the same time, this tutorial may also serve the developers of web information retrieval systems as a comprehensive and in-depth reference to the advanced log mining techniques.

[1]  Tie-Yan Liu,et al.  BrowseRank: letting web users vote for page importance , 2008, SIGIR '08.

[2]  Amanda Spink,et al.  U.S. versus European web searching trends , 2002, SIGF.

[3]  Carlos A. Hurtado,et al.  Automatic Maintenance ofWeb Directories using Click-Through Data , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[4]  Alan Halverson,et al.  Generating labels from clicks , 2009, WSDM '09.

[5]  Ravi Kumar,et al.  On anonymizing query logs via token-based hashing , 2007, WWW '07.

[6]  Fidel Cacheda,et al.  Understanding how people use search engines: a statistical analysis for e-Business , 2000 .

[7]  Qiang Yang,et al.  Web-page summarization using clickthrough data , 2005, SIGIR '05.

[8]  Benjamin Piwowarski,et al.  A user browsing model to predict search engine click data from past observations. , 2008, SIGIR '08.

[9]  Omid Madani,et al.  A large-scale analysis of query logs for assessing personalization opportunities , 2006, KDD '06.

[10]  Clement T. Yu,et al.  Personalized web search by mapping user queries to categories , 2002, CIKM '02.

[11]  Rosie Jones,et al.  Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs , 2008, CIKM '08.

[12]  Susan T. Dumais,et al.  Personalizing Search via Automated Analysis of Interests and Activities , 2005, SIGIR.

[13]  Wei-Ying Ma,et al.  Probabilistic query expansion using query logs , 2002, WWW '02.

[14]  Ricardo A. Baeza-Yates,et al.  Query Recommendation Using Query Logs in Search Engines , 2004, EDBT Workshops.

[15]  Thorsten Joachims,et al.  Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[16]  Susan T. Dumais,et al.  Learning user interaction models for predicting web search result preferences , 2006, SIGIR.

[17]  Amanda Spink,et al.  An analysis of Web searching by European AlltheWeb.com users , 2005, Inf. Process. Manag..

[18]  Xin Jin,et al.  Web usage mining based on probabilistic latent semantic analysis , 2004, KDD.

[19]  Hema Raghavan,et al.  Discovering users' specific geo intention in web search , 2009, WWW '09.

[20]  Hongjun Lu,et al.  ReCoM: reinforcement clustering of multi-type interrelated data objects , 2003, SIGIR.

[21]  Filip Radlinski,et al.  Minimally Invasive Randomization for Collecting Unbiased Preferences from Clickthrough Logs , 2006, AAAI 2006.

[22]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[23]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[24]  Eugene Agichtein,et al.  Towards Privacy-Preserving Query Log Publishing , 2007 .

[25]  ChengXiang Zhai,et al.  Implicit user modeling for personalized search , 2005, CIKM '05.

[26]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[27]  Deepak Agarwal,et al.  Spatio-temporal models for estimating click-through rate , 2009, WWW '09.

[28]  Steve Chien,et al.  Semantic similarity between search engine queries using temporal correlation , 2005, WWW '05.

[29]  Wei-Ying Ma,et al.  Optimizing web search using web click-through data , 2004, CIKM '04.

[30]  Feng Qiu,et al.  Automatic identification of user interest for personalized search , 2006, WWW '06.

[31]  Xiao Li,et al.  Learning query intent from regularized click graphs , 2008, SIGIR '08.

[32]  Yen-Jen Oyang,et al.  Relevant term suggestion in interactive web search based on contextual information in query session logs , 2003, J. Assoc. Inf. Sci. Technol..

[33]  ChengXiang Zhai,et al.  Mining long-term search history to improve search accuracy , 2006, KDD '06.

[34]  HenzingerMonika,et al.  Analysis of a very large web search engine query log , 1999 .

[35]  Eytan Adar,et al.  User 4XXXXX9: Anonymizing Query Logs , 2007 .

[36]  Ji-Rong Wen,et al.  WWW 2007 / Track: Search Session: Personalization A Largescale Evaluation and Analysis of Personalized Search Strategies ABSTRACT , 2022 .

[37]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[38]  Ricardo Baeza-Yates,et al.  Query-sets: using implicit feedback and query patterns to organize web documents , 2008, WWW.

[39]  D. Sculley,et al.  Predicting bounce rates in sponsored search advertisements , 2009, KDD.

[40]  Nina Mishra,et al.  Releasing search queries and clicks privately , 2009, WWW '09.

[41]  Chao Liu,et al.  Click chain model in web search , 2009, WWW '09.

[42]  Steve Fox,et al.  Evaluating implicit measures to improve web search , 2005, TOIS.

[43]  Huan Liu,et al.  CubeSVD: a novel approach to personalized Web search , 2005, WWW '05.

[44]  Filip Radlinski,et al.  Active exploration for learning rankings from clickthrough data , 2007, KDD '07.

[45]  Amanda Spink,et al.  From E-Sex to E-Commerce: Web Search Changes , 2002, Computer.

[46]  Christoph Hölscher,et al.  Web search behavior of Internet experts and newbies , 2000, Comput. Networks.

[47]  Jaime Teevan,et al.  Information re-retrieval: repeat queries in Yahoo's logs , 2007, SIGIR.

[48]  Eric Brill,et al.  Improving web search ranking by incorporating user behavior information , 2006, SIGIR.

[49]  Olivier Chapelle,et al.  A dynamic bayesian network click model for web search ranking , 2009, WWW '09.

[50]  Ryen W. White,et al.  Studying the use of popular destinations to enhance web search interaction , 2007, SIGIR.

[51]  Kenneth Ward Church,et al.  Entropy of search logs: how hard is search? with personalization? with backoff? , 2008, WSDM '08.

[52]  Benjamin Van Durme,et al.  What You Seek Is What You Get: Extraction of Class Attributes from Query Logs , 2007, IJCAI.

[53]  Ricardo A. Baeza-Yates,et al.  Query Clustering for Boosting Web Page Ranking , 2004, AWIC.

[54]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[55]  Ryen W. White,et al.  Mining the search trails of surfing crowds: identifying relevant websites from user activity , 2008, WWW.

[56]  Daqing He,et al.  Combining evidence for automatic Web session identification , 2002, Inf. Process. Manag..

[57]  Ahmed Hassan Awadallah,et al.  Beyond DCG: user behavior as a predictor of a successful search , 2010, WSDM '10.

[58]  Amanda Spink,et al.  An analysis of document viewing patterns of Web search engine users , 2005 .

[59]  Berthier A. Ribeiro-Neto,et al.  Concept-based interactive query expansion , 2005, CIKM '05.

[60]  Jon M. Kleinberg,et al.  Spatial variation in search engine queries , 2008, WWW.

[61]  Ángel Viña,et al.  Experiences retrieving information in the world wide web , 2001, Proceedings. Sixth IEEE Symposium on Computers and Communications.

[62]  Chao Liu,et al.  BBM: bayesian browsing model from petabyte-scale data , 2009, KDD.

[63]  Ryen W. White,et al.  WWW 2007 / Track: Browsers and User Interfaces Session: Personalization Investigating Behavioral Variability in Web Search , 2022 .

[64]  Ariel Fuxman,et al.  Using the wisdom of the crowds for keyword generation , 2008, WWW.

[65]  Gang Wang,et al.  Understanding user's query intent with wikipedia , 2009, WWW '09.

[66]  Min Zhao,et al.  Adapting Document Ranking to Users' Preferences Using Click-Through Data , 2006, AIRS.

[67]  Dimitrios Gunopulos,et al.  Identifying similarities, periodicities and bursts for online search queries , 2004, SIGMOD '04.

[68]  Xiaojie Yuan,et al.  Are click-through data adequate for learning web search rankings? , 2008, CIKM '08.

[69]  Marius Pasca,et al.  Organizing and searching the world wide web of facts -- step two: harnessing the wisdom of the crowds , 2007, WWW '07.

[70]  Alexander Pretschner,et al.  Ontology based personalized search , 1999, Proceedings 11th International Conference on Tools with Artificial Intelligence.

[71]  Enhong Chen,et al.  Towards context-aware search by learning a very large variable length hidden markov model from search logs , 2009, WWW '09.

[72]  Ophir Frieder,et al.  Automatic classification of Web queries using very large unlabeled query logs , 2007, TOIS.

[73]  Susan T. Dumais,et al.  To personalize or not to personalize: modeling queries with variation in user intent , 2008, SIGIR '08.

[74]  Amanda Spink,et al.  Vox populi: The public searching of the web , 2001, J. Assoc. Inf. Sci. Technol..

[75]  ChengXiang Zhai,et al.  Learn from web search logs to organize search results , 2007, SIGIR.

[76]  Wei Yuan,et al.  Smoothing clickthrough data for web search ranking , 2009, SIGIR.

[77]  Nick Craswell,et al.  Random walks on the click graph , 2007, SIGIR.

[78]  Xuehua Shen,et al.  Context-sensitive information retrieval using implicit feedback , 2005, SIGIR '05.

[79]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[80]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[81]  Matthew Richardson,et al.  Predicting clicks: estimating the click-through rate for new ads , 2007, WWW '07.

[82]  Vassilis Plachouras,et al.  Online learning from click data for sponsored search , 2008, WWW.

[83]  Filip Radlinski,et al.  Query chains: learning to rank from implicit feedback , 2005, KDD '05.

[84]  Ryen W. White,et al.  Predicting user interests from contextual information , 2009, SIGIR.

[85]  Hang Li,et al.  A unified and discriminative model for query refinement , 2008, SIGIR '08.

[86]  Ophir Frieder,et al.  Temporal analysis of a very large topically categorized Web query log , 2007, J. Assoc. Inf. Sci. Technol..

[87]  Alissa Cooper,et al.  A survey of query log privacy-enhancing techniques from a policy perspective , 2008, TWEB.

[88]  Massimo Barbaro,et al.  A Face Is Exposed for AOL Searcher No , 2006 .

[89]  Ji-Rong Wen,et al.  Clustering user queries of a search engine , 2001, WWW '01.

[90]  Gilad Mishne,et al.  Mining rich session context to improve web search , 2009, KDD.

[91]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[92]  Hinrich Schütze,et al.  Personalized search , 2002, CACM.

[93]  Junghoo Cho,et al.  Automatically identifying localizable queries , 2008, SIGIR '08.

[94]  Benjamin Piwowarski,et al.  Mining user web search activity with layered bayesian networks or how to capture a click in its context , 2009, WSDM '09.

[95]  Susan Gauch,et al.  Personalizing Search Based on User Search Histories , 2004 .