CEDAR: The Dutch historical censuses as Linked Open Data

In this document we describe the CEDAR dataset, a five-star Linked Open Data representation of the Dutch historical censuses, conducted in the Netherlands once every 10 years from 1795 to 1971. We produce a linked dataset from a digitized sample of 2,288 tables. The dataset contains more than 6.8 million statistical observations about the demography, labour and housing of the Dutch society in the 18th, 19th and 20th centuries. The dataset is modeled using the RDF Data Cube vocabulary for multidimensional data, uses Open Annotation to express rules of data harmonization, and keeps track of the provenance of every single data point and its transformations using PROV. We link these observations to well known standard classification systems in social history, such as the Historical International Standard Classification of Occupations (HISCO) and the Amsterdamse Code (AC), which in turn link to DBpedia and GeoNames. The two main contributions of the dataset are the improvement of data integration and access for historical research, and the emergence of new historical data hubs, like classifications of historical religions and historical house types, in the Linked Open Data cloud.

[1]  Onno Boonstra,et al.  Twee eeuwen Nederland geteld. Onderzoek met de digitale Volks-, Beroeps- en Woningtellingen 1795-2001 , 2007 .

[2]  Raf Vanderstraeten,et al.  HISCO. Historical International Standard Classification of Occupations , 2003 .

[3]  Jens Lehmann,et al.  Hybrid Acquisition of Temporal Scopes for RDF Data , 2014, ESWC.

[4]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[5]  Frank van Harmelen,et al.  Semantic technologies for historical research: A survey , 2014, Semantic Web.

[6]  Reinhard Riedl,et al.  Semantic Similarity and Correlation of Linked Statistical Data Analysis , 2014 .

[7]  Surajit Chaudhuri,et al.  InfoGather: entity augmentation and attribute discovery by holistic matching with web tables , 2012, SIGMOD Conference.

[8]  Onno Boonstra,et al.  Repertorium van Nederlandse gemeenten, 1812-2006 , 2006 .

[9]  Reinhard Riedl,et al.  Towards Linked Statistical Data Analysis , 2013, SemStats@ISWC.

[10]  Luc Moreau,et al.  PROV-Overview. An Overview of the PROV Family of Documents , 2013 .

[11]  Giovanni Grasso,et al.  Linking Historical Data on the Web , 2014, International Semantic Web Conference.

[12]  Albert Meroño-Peñuela LSD Dimensions: Use and Reuse of Linked Statistical Data , 2014, EKAW.

[13]  Timothy Clark,et al.  Open Annotation Data Model , 2013 .

[14]  Michel C. A. Klein,et al.  Concept drift and how to identify it , 2011, J. Web Semant..

[15]  Alessandra Mileo,et al.  DRETa: Extracting RDF from Wikitables , 2013, International Semantic Web Conference.

[16]  Albert Meroño-Peñuela Semantic Web for the Humanities , 2013, ESWC.

[17]  Rinke Hoekstra,et al.  Linked Humanities Data: The Next Frontier? A Case-study in Historical Census Data , 2012, LISC@ISWC.

[18]  Rinke Hoekstra,et al.  Detecting and Reporting Extensional Concept Drift in Statistical Linked Data , 2013, SemStats@ISWC.

[19]  Albert Meroño-Peñuela,et al.  From Flat Lists to Taxonomies: Bottom-up Concept Scheme Generation in Linked Statistical Data , 2014, SemStats@ISWC.

[20]  Andriy Nikolov,et al.  Exploiting Linked Data Cubes with OpenCube Toolkit , 2014, International Semantic Web Conference.

[21]  Rinke Hoekstra,et al.  What Is Linked Historical Data? , 2014, EKAW.

[22]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[23]  V. de Boer,et al.  Dutch Ships and Sailors Linked Data Cloud , 2014 .

[24]  Jayant Madhavan,et al.  Applying WebTables in Practice , 2015, CIDR.

[25]  Albert Meroño-Peñuela,et al.  The Aggregate Dutch Historical Censuses , 2015 .

[26]  Christophe Guéret,et al.  Tracking down the habitat of folk songs , 2014 .

[27]  Viktor de Boer,et al.  Dutch Ships and Sailors Linked Data , 2014, SEMWEB.