The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives

The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. Of the few query logs publicly available, none combines size, scope, and diversity. The AQL is the first to do so, enabling research on new retrieval models and (diachronic) search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.

[1]  Carla Teixeira Lopes,et al.  The Evolution of Web Search User Interfaces - An Archaeological Analysis of Google Search Engine Result Pages , 2023, CHIIR.

[2]  Carla Teixeira Lopes,et al.  From 10 Blue Links Pages to Feature-Full Search Engine Results Pages - Analysis of the Temporal Evolution of SERP Features , 2023, CHIIR.

[3]  Martin Potthast,et al.  Continuous Integration for Reproducible Shared Tasks with TIRA.io , 2023, ECIR.

[4]  B. Koopman,et al.  The Impact of Query Refinement on Systematic Review Literature Search: A Query Log Analysis , 2022, ICTIR.

[5]  Iadh Ounis,et al.  Reproducing Personalised Session Search over the AOL Query Log , 2022, ECIR.

[6]  Asia J. Biega,et al.  Exposing Query Identification for Search Transparency , 2021, WWW.

[7]  Benno Stein,et al.  FastWARC: Optimizing Large-Scale Web Archive Analytics , 2021, ArXiv.

[8]  C. Lee Giles,et al.  What Were People Searching For? A Query Log Analysis of An Academic Search Engine , 2021, 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[9]  Yashar Moshfeghi NeuraSearch: Neuroscience and Information Retrieval , 2021, DESIRES.

[10]  Nick Craswell,et al.  ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search , 2020, CIKM.

[11]  Guido Zuccon,et al.  Counterfactual Online Learning to Rank , 2020, ECIR.

[12]  Bhaskar Mitra,et al.  Overview of the TREC 2019 deep learning track , 2020, ArXiv.

[13]  Nick Craswell,et al.  O VERVIEW OF THE TREC 2019 DEEP LEARNING TRACK , 2020 .

[14]  Guido Zuccon,et al.  Health Cards for Consumer Health Search , 2019, SIGIR.

[15]  M. de Rijke,et al.  Workshop on Fairness, Accountability, Confidentiality, Transparency, and Safety in Information Retrieval (FACTS-IR) , 2019, SIGIR.

[16]  Rocco De Nicola,et al.  Transparency in Keyword Faceted Search: An Investigation on Google Shopping , 2019, IRCDL.

[17]  Guido Zuccon,et al.  Health Cards to Assist Decision Making in Consumer Health Search , 2019, AMIA.

[18]  Benno Stein,et al.  TIRA Integrated Research Architecture , 2019, Information Retrieval Evaluation in a Changing World.

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  Susan T. Dumais,et al.  Improving Web Search Ranking by Incorporating User Behavior Information , 2019, SIGIR Forum.

[21]  M. de Rijke,et al.  Differentiable Unbiased Online Learning to Rank , 2018, CIKM.

[22]  Ricardo Campos,et al.  Online Job Search: Study of Users' Search Behavior using Search Engine Query Logs , 2018, SIGIR.

[23]  Fernando Galindo,et al.  Freedom and the Internet: empowering citizens and addressing the transparency gap in search engines , 2017, Eur. J. Law Technol..

[24]  Xiang Zhou,et al.  Identification and Analysis of Multi-tasking Product Information Search Sessions with Query Logs , 2016, J. Data Inf. Sci..

[25]  Juan M. Fernández-Luna,et al.  Lucene4IR: Developing Information Retrieval Evaluation Resources using Lucene , 2017, SIGIR Forum.

[26]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[27]  Di Jiang,et al.  Cross-Lingual Topic Discovery From Multilingual Search Engine Query Log , 2016, ACM Trans. Inf. Syst..

[28]  Peter Bailey,et al.  UQV100: A Test Collection with Query Variability , 2016, SIGIR.

[29]  Hsin-Hsi Chen,et al.  Subtask Mining from Search Query Logs for How-Knowledge Acceleration , 2016, LREC.

[30]  A. Hanbury,et al.  How users search and what they search for in the medical domain , 2016, Information Retrieval Journal.

[31]  Ricardo Baeza-Yates,et al.  Incremental Sampling of Query Logs , 2015, SIGIR.

[32]  Claudia Hauff,et al.  Using Query-Log Based Collective Intelligence to Generate Query Suggestions for Tagged Content Search , 2015, ICWE.

[33]  Prof Dr Torsten Körber Common errors regarding search engine regulation —and how to avoid them , 2015 .

[34]  Ellen M. Voorhees,et al.  TREC 2014 Web Track Overview , 2015, TREC.

[35]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[36]  Tao Qin,et al.  Introducing LETOR 4.0 Datasets , 2013, ArXiv.

[37]  Matthias Hagen,et al.  From search session detection to search mission detection , 2013, OAIR.

[38]  Di Jiang,et al.  Beyond Click Graph: Topic Modeling for Search Engine Query Log Analysis , 2013, DASFAA.

[39]  S. Marchand-Maillet,et al.  Topic modelling of clickthrough data in image search , 2013, Multimedia tools and applications.

[40]  Haifeng Wang,et al.  User Behaviors Lend a Helping Hand: Learning Paraphrase Query Patterns from Search Log Sessions , 2012, COLING.

[41]  Charles L. A. Clarke,et al.  Overview of the TREC 2012 Web Track , 2012, TREC.

[42]  Tao Mei,et al.  When video search goes wrong: predicting query failure using search engine logs and visual search results , 2012, ACM Multimedia.

[43]  Benno Stein,et al.  TIRA: Configuring, Executing, and Disseminating Information Retrieval Experiments , 2012, 2012 23rd International Workshop on Database and Expert Systems Applications.

[44]  Qinghua Zheng,et al.  Mining query subtopics from search log data , 2012, SIGIR '12.

[45]  Jing Pan,et al.  Improving multi-faceted book search by incorporating sparse latent semantic analysis of click-through logs , 2012, JCDL '12.

[46]  Ying Chen,et al.  Cross Domain Random Walk for Query Intent Pattern Mining from Search Engine Log , 2011, 2011 IEEE 11th International Conference on Data Mining.

[47]  Enhong Chen,et al.  Mining Concept Sequences from Large-Scale Search Logs for Context-Aware Query Suggestion , 2011, TIST.

[48]  Franco Maria Nardini,et al.  Improving Europeana Search Experience Using Query Logs , 2011, TPDL.

[49]  Luo Si,et al.  Analysis of an expert search query log , 2011, SIGIR.

[50]  Giorgio Maria Di Nunzio,et al.  Web log analysis: a review of a decade of studies about information acquisition, inspection and interpretation of user interaction , 2011, Data Mining and Knowledge Discovery.

[51]  Eric Bruno,et al.  Query log simulation for long-term learning in image retrieval , 2011, 2011 9th International Workshop on Content-Based Multimedia Indexing (CBMI).

[52]  Filip Radlinski,et al.  Detecting duplicate web documents using clickthrough data , 2011, WSDM '11.

[53]  Charles L. A. Clarke,et al.  Overview of the TREC 2011 Web Track , 2011, TREC.

[54]  Yiqun Liu,et al.  Investigating Characteristics of Non-click Behavior Using Query Logs , 2010, AIRS.

[55]  Xu Sun,et al.  Learning Phrase-Based Spelling Error Models from Clickthrough Data , 2010, ACL.

[56]  Qiang Yang,et al.  Clickthrough Log Analysis by Collaborative Ranking , 2010, Proceedings of the AAAI Conference on Artificial Intelligence.

[57]  Charles L. A. Clarke,et al.  Overview of the TREC 2010 Web Track , 2010, TREC.

[58]  Chao Song,et al.  Sogou Query Log Analysis: A Case Study for Collaborative Recommendation or Personalized IR , 2009, 2009 International Conference on Asian Language Processing.

[59]  Nick Craswell,et al.  Overview of the TREC 2009 Web Track , 2009, TREC.

[60]  Giorgio Maria Di Nunzio,et al.  LogCLEF 2009: the CLEF 2009 Multilingual Logfile Analysis Track Overview , 2009, CLEF.

[61]  Charles L. A. Clarke,et al.  Reciprocal rank fusion outperforms condorcet and individual rank learning methods , 2009, SIGIR.

[62]  Daniel Gayo-Avello,et al.  A survey on session detection methods in query logs and a proposal for future evaluation , 2009, Inf. Sci..

[63]  Emily B. Laidlaw Private Power, Public Interest: An Examination of Search Engine Accountability , 2008, Int. J. Law Inf. Technol..

[64]  Erik Frøkjær,et al.  Improving web search transparency by using a Venn diagram interface , 2008, NordiCHI.

[65]  Alissa Cooper,et al.  A survey of query log privacy-enhancing techniques from a policy perspective , 2008, TWEB.

[66]  Joemon M. Jose,et al.  Affective feedback: an investigation into the role of emotions in the information seeking process , 2008, SIGIR '08.

[67]  Najafi Azadeh,et al.  REAL LIFE, REAL USERS AND REAL NEEDS: A STUDY AND ANALYSIS OF USER QUERIES ON THE WEB , 2008 .

[68]  Avi Arampatzis,et al.  Deriving a Domain Specific Test Collection from a Query Log , 2007, LaTeCH@ACL 2007.

[69]  James Grimmelmann,et al.  The Structure of Search Engine Law , 2007 .

[70]  Elmer V. Bernstam,et al.  A day in the life of PubMed: analysis of a typical day's query log. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[71]  Giorgio Maria Di Nunzio,et al.  Web Log Mining : A Study of User Sessions , 2007 .

[72]  Carlos A. Hurtado,et al.  A Statistical Model of Query Log Generation , 2006, SPIRE.

[73]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[74]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[75]  Filip Radlinski,et al.  Minimally Invasive Randomization for Collecting Unbiased Preferences from Clickthrough Logs , 2006, AAAI 2006.

[76]  Yan Lu,et al.  Mining the Query Logs of a Chinese Web Search Engine for Character Usage Analysis , 2006, PACIS.

[77]  Olivia R. Liu Sheng,et al.  Analysis of the query logs of a Web site search engine , 2005, J. Assoc. Inf. Sci. Technol..

[78]  Farooq Ahmad,et al.  Learning a Spelling Error Model from Search Query Logs , 2005, HLT.

[79]  Thorsten Joachims,et al.  Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[80]  Steve Chien,et al.  Semantic similarity between search engine queries using temporal correlation , 2005, WWW '05.

[81]  Amanda Spink,et al.  A temporal comparison of AltaVista Web searching , 2005, J. Assoc. Inf. Sci. Technol..

[82]  Amanda Spink,et al.  Methodological approach in discovering user search patterns through Web log analysis , 2005 .

[83]  Charles L. A. Clarke,et al.  The TREC 2005 Terabyte Track , 2005, TREC.

[84]  Ophir Frieder,et al.  Hourly analysis of a very large topically categorized web query log , 2004, SIGIR '04.

[85]  David Hawking,et al.  Overview of the TREC 2004 Web Track , 2004, TREC.

[86]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[87]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[88]  Ricardo A. Baeza-Yates,et al.  A Three Level Search Engine Index Based in Query Log Distribution , 2003, SPIRE.

[89]  Wei-Ying Ma,et al.  Query Expansion by Mining User Logs , 2003, IEEE Trans. Knowl. Data Eng..

[90]  Yen-Jen Oyang,et al.  Relevant term suggestion in interactive web search based on contextual information in query session logs , 2003, J. Assoc. Inf. Sci. Technol..

[91]  Shui-Lung Chuang,et al.  Enriching Web taxonomies through subject categorization of query terms from search engine logs , 2003, Decis. Support Syst..

[92]  David Hawking,et al.  Overview of the TREC 2003 Web Track , 2003, TREC.

[93]  Amanda Spink,et al.  An Analysis of Web Documents Retrieved and Viewed , 2003, International Conference on Internet Computing.

[94]  David Hawking,et al.  Overview of the TREC-2002 Web Track , 2002, TREC.

[95]  Amanda Spink,et al.  U.S. versus European web searching trends , 2002, SIGF.

[96]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[97]  Dell Zhang,et al.  A novel Web usage mining approach for search engines , 2002, Comput. Networks.

[98]  Wei-Ying Ma,et al.  Probabilistic query expansion using query logs , 2002, WWW '02.

[99]  Ji-Rong Wen,et al.  Query clustering using user logs , 2002, TOIS.

[100]  Amanda Spink,et al.  Vox populi: The public searching of the web , 2001, J. Assoc. Inf. Sci. Technol..

[101]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[102]  Jianfeng Gao,et al.  Mining generalized query patterns from web logs , 2001, Proceedings of the 34th Annual Hawaii International Conference on System Sciences.

[103]  Amanda Spink,et al.  Use of query reformulation and relevance feedback by Excite users , 2000, Internet Res..

[104]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[105]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[106]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[107]  Jan O. Pedersen,et al.  Phrase recognition and expansion for short, precision-biased queries based on a query log , 1999, SIGIR '99.

[108]  Amanda Spink,et al.  Real life information retrieval: a study of user queries on the Web , 1998, SIGF.

[109]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.