Evaluation and refinement of an enhanced OCR process for mass digitisation

Great expectations are placed on the capacity of heritage institutions to make their collections available in digital format. Data driven research is becoming a key concept within the humanities and social sciences. Kungliga biblioteket’s (National Library of Sweden, KB) collections of digitised newspaper can thus be regarded as unique cultural data sets with information that rarely is conveyed in other media types. The digital format makes it possible to explore these resources in ways not feasible while in printed form. As texts are no longer only read but also subjected to computer based analysis the demand on the correct rendering of the original text increases. OCR technologies for converting images to machine-readable text play a fundamental part in making these resources available, but the effectiveness vary with the type of document being processed. This is evident in relation to the digitisation of newspapers where factors relating to their production, layout and paper quality often impair the OCR production. In order to improve the machine readable text, especially in relation to the digitisation of newspapers, KB initiated the development of an OCR-module where key parameters can be adjusted according to the characteristics of the material being processed. The purpose of this paper is to present the project goals and methods.

[1]  Simon Tanner,et al.  Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library's 19th Century Online Newspaper Archive , 2009, D Lib Mag..

[2]  Rico Sennrich,et al.  Strategies for Reducing and Correcting OCR Errors , 2011, Language Technology for Cultural Heritage.

[3]  Timo Honkela,et al.  Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods , 2014 .

[4]  William B. Lund Ensemble Methods for Historical Machine-Printed Document Recognition , 2014 .

[5]  Rose Holley,et al.  How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs , 2009, D Lib Mag..

[6]  Miikka Silfverberg,et al.  Can Morphological Analyzers Improve the Quality of Optical Character Recognition , 2015 .

[7]  Markus Forsberg,et al.  The lexical editing system of Karp , 2013 .

[8]  Simon Clematide,et al.  Crowdsourcing an OCR Gold Standard for a German and French Heritage Corpus , 2016, LREC.

[9]  Abdel Belaïd,et al.  Hybrid OCR combination approach complemented by a specialized ICR applied on ancient documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[10]  Lynda Hardman,et al.  Impact Analysis of OCR Quality on Research Tasks in Digital Archives , 2015, TPDL.

[11]  Lars Borin,et al.  A free cloud service for OCR /En fri molntjänst för OCR Project report , 2016 .

[12]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[13]  Lars Borin,et al.  Naming the Past: Named Entity and Animacy Recognition in 19th Century Swedish Literature , 2007, LaTeCH@ACL 2007.

[15]  Markus Forsberg,et al.  A Diachronic Computational Lexical Resource for 800 Years of Swedish , 2011, Language Technology for Cultural Heritage.

[16]  Klaus U. Schulz,et al.  Precise and Efficient Text Correction using Levenshtein Automata , Dynamic Web Dictionaries and Optimized Correction Models , .