In order to support web applications to understand the content of HTML pages an increasing number of websites have started to annotate structured data within their pages using markup formats such as Microdata, RDFa, Microformats. The annotations are used by Google, Yahoo!, Yandex, Bing and Facebook to enrich search results and to display entity descriptions within their applications. In this paper, we present a series of publicly accessible Microdata, RDFa, Microformats datasets that we have extracted from three large web corpora dating from 2010, 2012 and 2013. Altogether, the datasets consist of almost 30 billion RDF quads. The most recent of the datasets contains amongst other data over 211 million product descriptions, 54 million reviews and 125 million postal addresses originating from thousands of websites. The availability of the datasets lays the foundation for further research on integrating and cleansing the data as well as for exploring its utility within different application contexts. As the dataset series covers four years, it can also be used to analyze the evolution of the adoption of the markup formats.
[1]
Roi Blanco,et al.
Enhanced results for web search
,
2011,
SIGIR.
[2]
C. Bizer,et al.
Integrating product data from websites offering microdata markup
,
2014,
WWW.
[3]
Lora Aroyo,et al.
The Semantic Web – ISWC 2013
,
2013,
Lecture Notes in Computer Science.
[4]
Johanna Völker,et al.
Deployment of RDFa, Microdata, and Microformats on the Web - A Quantitative Analysis
,
2013,
International Semantic Web Conference.
[5]
Sebastiano Vigna,et al.
Graph structure in the web --- revisited: a trick of the heavy tail
,
2014,
WWW.
[6]
Peter Mika,et al.
Metadata Statistics for a Large Web Corpus
,
2012,
LDOW.
[7]
Sebastiano Vigna,et al.
Graph structure in the web — Revisited, or a trick of the heavy tail. In WWW'14 Companion, pages 427−432
,
2014,
WWW 2014.