Creating a Complete Workflow for Digitising Historical Census Documents: Considerations and Evaluation

The 1961 Census of England and Wales was the first UK census to make use of computers. However, only bound volumes and microfilm copies of printouts remain, locking a wealth of information in a form that is practically unusable for research. In this paper, we describe process of creating the digitisation workflow that was developed as part of a pilot study for the Office for National Statistics. The emphasis of the paper is on the issues originating from the historical nature of the material and how they were resolved. The steps described include image pre-processing, OCR setup, table recognition, post-processing, data ingestion, crowdsourcing, and quality assurance. Evaluation methods and results are presented for all steps.

[1]  Apostolos Antonacopoulos,et al.  The PAGE (Page Analysis and Ground-Truth Elements) Format Framework , 2010, 2010 20th International Conference on Pattern Recognition.

[2]  Apostolos Antonacopoulos,et al.  The ENP image and ground truth dataset of historical newspapers , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[3]  J. Cordy,et al.  A Survey of Table Recognition : Models , Observations , Transformations , and Inferences , 2003 .

[4]  Apostolos Antonacopoulos,et al.  Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments , 2011, 2011 International Conference on Document Analysis and Recognition.

[5]  Daniel P. Lopresti,et al.  Evaluating the performance of table processing algorithms , 2002, International Journal on Document Analysis and Recognition.

[6]  Justin Hayes,et al.  Unearthing the Recent Past: Digitising and Understanding Statistical Information from Census Tables , 2017, DATeCH.

[7]  Apostolos Antonacopoulos,et al.  Scenario Driven In-depth Performance Evaluation of Document Layout Analysis Methods , 2011, 2011 International Conference on Document Analysis and Recognition.

[8]  Daniel P. Lopresti,et al.  A Tabular Survey of Automated Table Processing , 1999, GREC.

[9]  Luís Torgo,et al.  Design of an end-to-end method to extract information from tables , 2006, International Journal of Document Analysis and Recognition (IJDAR).