Digital sustainable publication of legacy parliamentary proceedings

We address the problem of publishing parliamentary proceedings in a digital sustainable manner. We give an extensive requirements analysis, and based on that propose a uniform XML format. We evaluated our approach by collecting and automatically processing proceedings from six parliaments spanning almost 200 years in total. Most of this data is real legacy data consisting of scanned and OCRed documents. The approach scales very well and produces high quality data. All documents are transformed into UTF-8 encoded XML files with extensive metadata in Dublin Core standard. The text itself is divided into pages which are divided into paragraphs. Every document, page and paragraph has a unique URN which resolves to a web page. Every page element in the XML files is connected to a facsimile image of that page in PDF or JPEG format. We created a viewer in which both versions can be inspected simultaneously. A search-engine for the complete collection is available online.

[1]  Toby Green,et al.  We need publishing standards for datasets and data tables , 2009, Learn. Publ..

[2]  Rens Vliegenthart,et al.  Divergent framing: The public debate on migration in the Dutch parliament and media, 1995–2004 , 2007 .

[3]  Valentin Jijkoun,et al.  Electoral search using the VerkiezingsKijker: an experience report , 2007, WWW '07.

[4]  D. Shaw,et al.  Agenda setting function of mass media , 1972 .

[5]  Maarten Marx,et al.  Long, often quite boring, notes of meetings , 2009, ESAIR '09.

[6]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[7]  Maarten Marx,et al.  DutchParl. The Parliamentary Documents in Dutch , 2010, LREC.

[8]  Maarten Marx,et al.  Exemelification of parliamentary debates , 2009 .

[9]  A. Pentland,et al.  Computational Social Science , 2009, Science.

[10]  Maarten Marx,et al.  Who said what to whom?: capturing the structure of debates , 2009, SIGIR.

[11]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[12]  Dustin Hillard,et al.  Computer-Assisted Topic Classification for Mixed-Methods Social Science Research , 2008 .

[13]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[14]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[15]  Tim Berners-Lee,et al.  Linked data , 2020, Semantic Web for the Working Ontologist.

[16]  Maarten Marx,et al.  Helping people to choose for whom to vote. a web information system for the 2009 European elections , 2009, CIKM.

[17]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[18]  Olaf Hartig Provenance Information in the Web of Data , 2009, LDOW.