Building and querying semantic layers for web archives (extended version)

Web archiving is the process of collecting portions of the Web to ensure that the information is preserved for future exploitation. However, despite the increasing number of web archives worldwide, the absence of efficient and meaningful exploration methods still remains a major hurdle in the way of turning them into a usable and useful information source. In this paper, we focus on this problem and propose an RDF/S model and a distributed framework for building semantic profiles (“layers”) that describe semantic information about the contents of web archives. A semantic layer allows describing metadata information about the archived documents, annotating them with useful semantic information (like entities, concepts, and events), and publishing all these data on the Web as Linked Data. Such structured repositories offer advanced query and integration capabilities, and make web archives directly exploitable by other systems and tools. To demonstrate their query capabilities, we build and query semantic layers for three different types of web archives. An experimental evaluation showed that a semantic layer can answer information needs that existing keyword-based systems are not able to sufficiently satisfy.

[1]  Avishek Anand,et al.  Tempas: Temporal Archive Search Based on Tags , 2016, WWW.

[2]  Sean Bechhofer,et al.  OWL: Web Ontology Language , 2009, Encyclopedia of Database Systems.

[3]  György Fazekas,et al.  Realising a Layered Digital Library: Exploration and Analysis of the Live Music Archive through Linked Data , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[4]  Wolfgang Nejdl,et al.  History by Diversity: Helping Historians search News Archives , 2016, CHIIR.

[5]  Yannis Tzitzikas,et al.  Stochastic reranking of biomedical search results based on extracted entities , 2017, J. Assoc. Inf. Sci. Technol..

[6]  悠太 菊池,et al.  大規模要約資源としてのNew York Times Annotated Corpus , 2015 .

[7]  Athman Bouguettaya,et al.  Trusting the Social Web: issues and challenges , 2013, World Wide Web.

[8]  Herbert Van de Sompel,et al.  Web Archive Profiling Through CDX Summarization , 2015, TPDL.

[9]  Frank van Harmelen,et al.  Web Ontology Language: OWL , 2004, Handbook on Ontologies.

[10]  Herbert Van de Sompel,et al.  Designing the W3C open annotation data model , 2013, WebSci.

[11]  Avishek Anand,et al.  ArchiveSpark: Efficient Web archive access, extraction and derivation , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[12]  Herbert Van de Sompel,et al.  Routing memento requests using binary classifiers , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[13]  Wolfgang Nejdl,et al.  How to Search the Internet Archive Without Indexing It , 2016, TPDL.

[14]  Srikanta J. Bedathur,et al.  EverLast: a distributed architecture for preserving the web , 2009, JCDL '09.

[15]  Thomas Risse,et al.  Accessing web archives from different perspectives with potential synergies , 2017 .

[16]  Claudia Niederée,et al.  Beyond Time: Dynamic Context-Aware Entity Recommendation , 2017, ESWC.

[17]  Kostas Stefanidis,et al.  Multi-aspect Entity-Centric Analysis of Big Social Media Archives , 2017, TPDL.

[18]  Yannis Tzitzikas,et al.  Querying the Web of Data with SPARQL-LD , 2016, TPDL.

[19]  Gary Marchionini,et al.  Exploratory search , 2006, Commun. ACM.

[20]  Mitchell Whitelaw,et al.  Generous Interfaces for Digital Cultural Collections , 2015, Digit. Humanit. Q..

[21]  Sébastien Ferré,et al.  Sparklis: An expressive query builder for SPARQL endpoints with guidance in natural language , 2016, Semantic Web.

[22]  Wolfgang Nejdl,et al.  Towards a Ranking Model for Semantic Layers over Digital Archives , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[23]  Yannis Tzitzikas,et al.  Faceted exploration of RDF/S datasets: a survey , 2017, Journal of Intelligent Information Systems.

[24]  Wolfgang Nejdl,et al.  Exploring Web Archives Through Temporal Anchor Texts , 2017, WebSci.

[25]  Giuseppe Ottaviano,et al.  Fast and Space-Efficient Entity Linking for Queries , 2015, WSDM.

[26]  Timothy Clark,et al.  Open Annotation Data Model , 2013 .

[27]  Thomas Risse,et al.  Extracting Event-Centric Document Collections from Large-Scale Web Archives , 2017, TPDL.

[28]  Ji Zhang,et al.  A Probabilistic Model for Time-Aware Entity Recommendation , 2016, SEMWEB.

[29]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[30]  Michael Hausenblas,et al.  Describing linked datasets with the VoID vocabulary , 2011 .

[31]  James P. Callan,et al.  Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding , 2017, WWW.

[32]  Óscar Corcho,et al.  Federating queries in SPARQL 1.1: Syntax, semantics and evaluation , 2013, J. Web Semant..

[33]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[34]  Jimmy J. Lin,et al.  Desiderata for exploratory search interfaces to Web archives in support of scholarly activities , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[35]  David McG. Squire,et al.  Deconstructing Bricolage: Interactive Online Analysis of Compiled Texts with Factotum , 2015, Digit. Humanit. Q..

[36]  Tom Heath,et al.  Linked Data: Evolving the Web into a Global Data Space , 2011, Linked Data.

[37]  Giovanni Maria Sacco,et al.  Dynamic Taxonomies and Faceted Search: Theory, Practice, and Experience , 2009, The Information Retrieval Series.

[38]  Jun Zhao,et al.  Describing Linked Datasets On the Design and Usage of voiD, the "Vocabulary Of Interlinked Datasets" , 2009 .

[39]  Herbert Van de Sompel,et al.  Web Archive Profiling Through Fulltext Search , 2016, TPDL.

[40]  Evgeny Kharlamov,et al.  SemFacet: semantic faceted search over yago , 2014, WWW.

[41]  Peter Mika,et al.  Searching through time in the New York Times HCIR Challenge 2010 , 2010 .

[42]  Jimmy J. Lin,et al.  Infrastructure for supporting exploration and discovery in web archives , 2014, WWW '14 Companion.

[43]  Herbert Van de Sompel,et al.  Profiling web archive coverage for top-level domain and content language , 2013, International Journal on Digital Libraries.

[44]  Yannis Tzitzikas,et al.  Exploiting Linked Data for Open and Configurable Named Entity Extraction , 2015, Int. J. Artif. Intell. Tools.

[45]  Sébastien Ferré,et al.  SPARKLIS: a SPARQL Endpoint Explorer for Expressive Question Answering , 2014, SEMWEB.

[46]  Jens Lehmann,et al.  Template-based question answering over RDF data , 2012, WWW.

[47]  Michele C. Weigle,et al.  Visualizing digital collections at archive-it , 2012, JCDL '12.

[48]  Gerhard Weikum,et al.  Longitudinal Analytics on Web Archive Data: It's About Time! , 2011, CIDR.

[49]  Wolfgang Nejdl,et al.  Can we find documents in web archives without knowing their contents? , 2016, WebSci.

[50]  Gerhard Weikum,et al.  Robust Disambiguation of Named Entities in Text , 2011, EMNLP.

[51]  Herbert Van de Sompel,et al.  HTTP Framework for Time-Based Access to Resource States - Memento , 2013, RFC.

[52]  Paolo Ferragina,et al.  TAGME: on-the-fly annotation of short text fragments (by wikipedia entities) , 2010, CIKM.

[53]  Roberto Navigli,et al.  Entity Linking meets Word Sense Disambiguation: a Unified Approach , 2014, TACL.

[54]  Wolfgang Nejdl,et al.  Expedition: A Time-Aware Exploratory Search System Designed for Scholars , 2016, SIGIR.