Keep, Change or Delete? Setting up a Low Resource OCR Post-correction Framework for a Digitized Old Finnish Newspaper Collection

There has been a huge interest in digitization of both hand-written and printed historical material in the last 10–15 years and most probably this interest will only increase in the ongoing Digital Humanities era. As a result of the interest we have lots of digital historical document collections available and will have more of them in the future.

[1]  Rose Holley,et al.  How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs , 2009, D Lib Mag..

[2]  Fachgebiet Wissensbasierte Unsupervised Post-Correction of OCR Errors , 2010 .

[3]  Daniel P. Lopresti,et al.  Optical character recognition errors and their effects on natural language processing , 2008, AND '08.

[4]  Timo Honkela,et al.  Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods , 2014 .

[5]  Kazem Taghva,et al.  Evaluation of model-based retrieval effectiveness with OCR text , 1996, TOIS.

[6]  Simon Tanner,et al.  Measuring Mass Text Digitization Quality and Usefulness , 2009 .

[7]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[8]  Majlis Bremer-Laamanen Connecting to the past: Newspaper digitization in the Nordic countries , 2006 .

[9]  Kimmo Kettunen,et al.  How to do lexical quality estimation of a large OCRed historical Finnish newspaper collection with scarce resources , 2016, Digital Studies/Le champ numérique.

[10]  Simon Tanner,et al.  Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library's 19th Century Online Newspaper Archive , 2009, D Lib Mag..

[11]  Otto Chrons,et al.  Digitalkoot: Making Old Archives Accessible Using Crowdsourcing , 2011, Human Computation.

[12]  Edwin Klijn The Current State-of-art in Newspaper Digitization: A Market Perspective , 2008, D Lib Mag..

[13]  Hartmut Walravens A NORDIC DIGITAL NEWSPAPER LIBRARY , 2006 .

[14]  Martin Volk,et al.  Reducing OCR Errors in Gothic-Script Documents , 2011, ERCIM News.

[15]  R. Segal,et al.  A Market Perspective , 2003 .