Harmonizing Aggregate Historical Dutch Census Data: a Flexible Approach

Historical censuses are the richest source of statistical data about a nation’s past. These censuses include valuable data about the population characteristics, housing information and socio-economic data. Not only are they taken consistently over time, in our case for almost two centuries, they also cover the entire nation geographically. The challenges of using these data for studies over time and space are well known, documented and shared by many projects focusing on the harmonization of historical census data. Napoleonic influences during the Batavian Republic were responsible for the first versions of the historical Dutch censuses. Starting with a general enumeration in 1795, over thirty years later the first official census was introduced in Netherlands and continued until 1971 in its traditional ‘door to door’ form. To make this valuable source of data better accessible and available for study, digitization effort where undertaken which resulted in the creation of thousands of scans representing the original census books. These images were later transcribed to Excel tables making them computer processable, but not solving the problem of harmonization. To solve this problem we apply Semantic Web technologies, more specifically the Resource Description Framework (RDF) and propose a specific (three tier) model to harmonize the Dutch historical censuses. We convert the data in RDF in such a way that we preserve all the peculiarities and hierarchies of the original tables. In other words, we provide a one to one representations of the Excel tables in RDF. Although now available and in one system (instead of having over two thousands heterogeneous Excel files) the data still has to be harmonized in order to allow comparisons across time and space. We acknowledge that even when using machine readable formats and Semantic Web technologies, the data still has to be formally defined by ‘expert users’, i.e. the historians working with the data. However, even for these users this is not a straightforward task. As we cannot claim to know upfront the ‘best’ harmonization model we recognize the need for a flexible approach which allow the users to build and test their harmonizations in a very efficient process, especially when dealing with only aggregated data. Current literature does not provide enough insights in the practice of data harmonization when dealing with aggregated census data. We have created a multilevel ‘flexible’ workflow, consisting of a set of specific harmonization practices. We do all of these transformation in RDF while not compromising the underlying sources. Being able to provide direct links from the harmonized results to the original Excel tables and images is a key requirement of our model.