ORCA - a Benchmark for Data Web Crawlers

The number of RDF knowledge graphs available on the Web grows constantly. Gathering these graphs at large scale for downstream applications hence requires the use of crawlers. Although Data Web crawlers exist, and general Web crawlers could be adapted to focus on the Data Web, there is currently no benchmark to fairly evaluate their performance. Our work closes this gap by presenting the Orca benchmark. Orca generates a synthetic Data Web, which is decoupled from the original Web and enables a fair and repeatable comparison of Data Web crawlers. Our evaluations show that Orca can be used to reveal the different advantages and disadvantages of existing crawlers. The benchmark is open-source and available at https://w3id.org/dice-research/orca.

[1]  Jürgen Umbrich,et al.  MultiCrawler: A Pipelined Architecture for Crawling and Indexing Semantic Web Data , 2006, SEMWEB.

[2]  Sebastiano Vigna,et al.  BUbiNG: massive crawling for the masses , 2014, WWW.

[3]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[4]  Axel-Cyrille Ngonga Ngomo,et al.  HOBBIT: A platform for benchmarking Big Linked Data , 2020, Data Sci..

[5]  Heiko Paulheim,et al.  Discoverability of SPARQL Endpoints in Linked Open Data , 2013, SEMWEB.

[6]  Jens Lehmann,et al.  LODStats - An Extensible Framework for High-Performance Dataset Analytics , 2012, EKAW.

[7]  Jürgen Umbrich,et al.  Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine , 2011, J. Web Semant..

[8]  Axel Polleres,et al.  Binary RDF representation for publication and exchange (HDT) , 2013, J. Web Semant..

[9]  Jürgen Umbrich,et al.  An empirical survey of Linked Data conformance , 2012, J. Web Semant..

[10]  Stefan Schlobach,et al.  LOD Laundromat: A Uniform Way of Publishing Other People's Dirty Data , 2014, SEMWEB.

[11]  Johanna Völker,et al.  Deployment of RDFa, Microdata, and Microformats on the Web - A Quantitative Analysis , 2013, International Semantic Web Conference.

[12]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[13]  Ivan Herman,et al.  RDFa 1.1 Primer — Third Edition, W3C Note , 2015 .

[14]  Heiko Paulheim,et al.  Adoption of the Linked Data Best Practices in Different Topical Domains , 2014, SEMWEB.

[15]  Jürgen Umbrich,et al.  LDspider: An Open-source Crawling Framework for the Web of Linked Data , 2010, SEMWEB.

[16]  Axel-Cyrille Ngonga Ngomo,et al.  Squirrel - Crawling RDF Knowledge Graphs on the Web , 2020, SEMWEB.

[17]  Filippo Menczer,et al.  A General Evaluation Framework for Topical Crawlers , 2005, Information Retrieval.

[18]  Jens Lehmann,et al.  LODStats: The Data Web Census Dataset , 2016, SEMWEB.

[19]  Achim Rettinger,et al.  Linked data quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO , 2017, Semantic Web.

[20]  Adam Rifkin,et al.  Nutch: A Flexible and Scalable Open-Source Web Search Engine , 2005 .

[21]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[22]  Martijn Koster,et al.  Robots Exclusion Protocol , 2020, RFC.

[23]  Aidan Hogan,et al.  Exploiting RDFS and OWL for Integrating Heterogeneous, Large-Scale, Linked Data Corpora , 2011 .