Europeana Newspapers OCR Workflow Evaluation

This paper summarises the final performance evaluation results of the OCR workflow which was employed for large-scale production in the Europeana Newspapers project. It gives a detailed overview of how the involved software performed on a representative dataset of historical newspaper pages (for which ground truth was created) with regard to general text accuracy as well as layout-related factors which have an impact on how the material can be used in specific use scenarios. Specific types of errors are examined and evaluated in order to identify possible improvements related to the employed document image analysis and recognition methods. Moreover, alternatives to the standard production workflow are assessed to determine future directions and give advice on best practice related to OCR projects.

[1]  Stephen V. Rice,et al.  Measuring the accuracy of page-reading systems , 1996 .

[2]  Apostolos Antonacopoulos,et al.  The ENP image and ground truth dataset of historical newspapers , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[3]  Jody Condit Fagan The European Library , 2009 .

[4]  Apostolos Antonacopoulos,et al.  ICDAR 2013 Competition on Historical Newspaper Layout Analysis (HNLA 2013) , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[5]  Simon Tanner,et al.  Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library's 19th Century Online Newspaper Archive , 2009, D Lib Mag..

[6]  Apostolos Antonacopoulos,et al.  ICDAR 2013 Competition on Historical Book Recognition (HBR 2013) , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[7]  Rose Holley,et al.  How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs , 2009, D Lib Mag..

[8]  Bjarki Valtysson EUROPEANA , 2012 .

[9]  Apostolos Antonacopoulos,et al.  The Significance of Reading Order in Document Recognition and Its Evaluation , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[10]  C. Clausner,et al.  ICDAR2015 competition on recognition of documents with complex layouts - RDCL2015 , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[11]  R. Brandsma METS: Metadata Encoding & Transmission Standard , 2007 .

[12]  Apostolos Antonacopoulos,et al.  The PAGE (Page Analysis and Ground-Truth Elements) Format Framework , 2010, 2010 20th International Conference on Pattern Recognition.

[13]  Apostolos Antonacopoulos,et al.  Scenario Driven In-depth Performance Evaluation of Document Layout Analysis Methods , 2011, 2011 International Conference on Document Analysis and Recognition.