Full-Text and URL Search Over Web Archives

Web archives are a historically valuable source of information. In some respects, web archives are the only record of the evolution of human society in the last two decades. They preserve a mix of personal and collective memories, the importance of which tends to grow as they age. However, the value of web archives depends on their users being able to search and access the information they require in efficient and effective ways. Without the possibility of exploring and exploiting the archived contents, web archives are useless. Web archive access functionalities range from basic browsing to advanced search and analytical services, accessed through user-friendly interfaces. Full-text and URL search have become the predominant and preferred forms of information discovery in web archives, fulfilling user needs and supporting search APIs that feed complex applications. Both full-text and URL search are based on the technology developed for modern web search engines, since the Web is the main resource targeted by both systems. However, while web search engines enable searching over the most recent web snapshot, web archives enable searching overmultiple snapshots from the past. Thismeans that web archives have to deal with a temporal dimension that is the cause of new challenges and opportunities, discussed throughout this chapter. Miguel Costa Vodafone Research, e-mail: miguel.costa2@vodafone.com

[1]  Michael L. Nelson,et al.  Who and what links to the Internet Archive , 2014, International Journal on Digital Libraries.

[2]  Miguel Costa,et al.  Characterizing Search Behavior in Web Archives , 2011, TWAW.

[3]  Fernando Diaz,et al.  Temporal profiles of queries , 2007, TOIS.

[4]  ZaragozaHugo,et al.  The Probabilistic Relevance Framework , 2009 .

[5]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[6]  Edward A. Fox,et al.  Guest editors’ introduction to the special issue on web archiving , 2016, International Journal on Digital Libraries.

[7]  David W. Aha,et al.  What's in a URL? Genre Classification from URLs , 2012, AAAI 2012.

[8]  Ian Milligan,et al.  The SAGE Handbook of Web History , 2018 .

[9]  Wolfgang Nejdl,et al.  Expedition: A Time-Aware Exploratory Search System Designed for Scholars , 2016, SIGIR.

[10]  Miguel Costa,et al.  The evolution of web archiving , 2017, International Journal on Digital Libraries.

[11]  Julian Szymanski,et al.  Semantic URL Analytics to Support Efficient Annotation of Large Scale Web Archives , 2015, International KEYSTONE Conference.

[12]  Michael L. Nelson,et al.  Access patterns for robots and humans in web archives , 2013, JCDL '13.

[13]  Wolfgang Nejdl,et al.  Exploring Web Archives Through Temporal Anchor Texts , 2017, WebSci.

[14]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[15]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[16]  Wolfgang Nejdl,et al.  History by Diversity: Helping Historians search News Archives , 2016, CHIIR.

[17]  Ricardo Campos,et al.  Survey of Temporal Information Retrieval and Related Applications , 2014, ACM Comput. Surv..

[18]  Arthur Thomas,et al.  Researcher Engagement with Web Archives: State of the Art , 2010 .

[19]  Michael Gertz,et al.  Temporal Information Retrieval , 2009, Encyclopedia of Database Systems.

[20]  Susan T. Dumais,et al.  Leveraging temporal dynamics of document content in relevance ranking , 2010, WSDM '10.

[21]  Wolfgang Nejdl,et al.  Building and querying semantic layers for web archives (extended version) , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[22]  Roi Blanco,et al.  Temporal Information Retrieval , 2015, Found. Trends Inf. Retr..

[23]  Marti A. Hearst Chapter 2 of the second edition of Modern Information Retrieval Renamed Modern Information Retrieval : The Concepts and Technology behind Search , 2011 .

[24]  Bhaskar Mitra,et al.  An Introduction to Neural Information Retrieval , 2018, Found. Trends Inf. Retr..

[25]  Miguel Costa,et al.  Learning temporal-dependent ranking models , 2014, SIGIR.

[26]  Mário J. Silva,et al.  Understanding the Information Needs of Web Archive Users , 2010 .

[27]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[28]  Susan T. Dumais,et al.  The web changes everything: understanding the dynamics of web content , 2009, WSDM '09.

[29]  Wolfgang Nejdl,et al.  Can we find documents in web archives without knowing their contents? , 2016, WebSci.

[30]  Herbert Van de Sompel,et al.  Memento: Time Travel for the Web , 2009, ArXiv.

[31]  Julien Masanès,et al.  Web Archiving , 2014, Encyclopedia of Social Network Analysis and Mining.

[32]  Monika Henzinger,et al.  Purely URL-based topic classification , 2009, WWW '09.

[33]  Wolfgang Nejdl,et al.  How to Search the Internet Archive Without Indexing It , 2016, TPDL.

[34]  Monika Henzinger,et al.  Web page language identification based on URLs , 2008, Proc. VLDB Endow..