Flexible character accuracy measure for reading-order-independent evaluation

Abstract The extraction of textual information from scanned document pages is a fundamental stage in any digitisation effort and directly determines the success of the overall document analysis and understanding application scenarios. To evaluate and improve the performance of optical character recognition (OCR), it is necessary to measure the accuracy of that step alone, without the influence of the processing steps that precede it (e.g. text block segmentation and ordering). Current OCR performance evaluation measures (based on edit distance) are strongly subjective as they need to first serialise the entire text in the documents – a process influenced heavily by the specific reading order determined (often wrongly, especially in cases of multicolumn and complex layouts) by processing steps prior to OCR. This paper presents a new objective and practical edit-distance-based character recognition accuracy measure which overcomes those limitations. It achieves its independence from the reading order by comparing sub-strings of text in a flexible way (i.e. allowing for ordering variations). The precision of the flexible character accuracy measure enables the effective tuning of complete digitisation workflows (as OCR errors are isolated and other steps can be evaluated and optimised separately). For the same reason, it also enables a better estimation of post-OCR (manual) correction effort required. The proposed character accuracy measure has been systematically analysed and validated under lab conditions as well as successfully used in practice in a number of high-profile international competitions since 2017.

[1]  Stephen V. Rice,et al.  Measuring the accuracy of page-reading systems , 1996 .

[2]  Simon Tanner,et al.  Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library's 19th Century Online Newspaper Archive , 2009, D Lib Mag..

[3]  Apostolos Antonacopoulos,et al.  ICDAR 2013 Competition on Historical Book Recognition (HBR 2013) , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[4]  C. Clausner,et al.  ICDAR2019 Competition on Recognition of Documents with Complex Layouts - RDCL2019 , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[5]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[6]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[7]  Apostolos Antonacopoulos,et al.  Europeana Newspapers OCR Workflow Evaluation , 2015, HIP@ICDAR.

[8]  Apostolos Antonacopoulos,et al.  ICFHR 2018 Competition on Recognition of Historical Arabic Scientific Manuscripts – RASM2018 , 2018, 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[9]  Apostolos Antonacopoulos,et al.  Scenario Driven In-depth Performance Evaluation of Document Layout Analysis Methods , 2011, 2011 International Conference on Document Analysis and Recognition.

[10]  Kaspar Riesen,et al.  Approximate graph edit distance computation by means of bipartite graph matching , 2009, Image Vis. Comput..

[11]  Artur Jez,et al.  Edit Distance with Block Operations , 2018, ESA.

[12]  Rose Holley,et al.  How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs , 2009, D Lib Mag..

[13]  Stephen V. Rice,et al.  The Fourth Annual Test of OCR Accuracy , 1995 .