A comprehensive evaluation methodology for noisy historical document recognition techniques

In this paper, we propose a new comprehensive methodology in order to evaluate the performance of noisy historical document recognition techniques. We aim to evaluate not only the final noisy recognition result but also the main intermediate stages of text line, word and character segmentation. For this purpose, we efficiently create the text line, word and character segmentation ground truth guided by the transcription of the historical documents. The proposed methodology consists of (i) a semiautomatic procedure in order to detect the text line, word and character segmentation ground truth regions making use of the correct document transcription, (ii) calculation of proper evaluation metrics in order to measure the performance of the final OCR result as well as of the intermediate segmentation stages. The semi-automatic procedure for detecting the ground truth regions has been evaluated and proved efficient and time saving. Experimental results prove that using the proposed technique, the percentage of time saved for the text line, word and character segmentation ground truth creation is more than 90%. An analytic experiment using a commercial OCR engine applied to a historical book is also presented.

[1]  Ioannis Pratikakis,et al.  An old greek handwritten OCR system based on an efficient segmentation-free approach , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[2]  Horst Bunke,et al.  Automatic segmentation of the IAM off-line database for handwritten English text , 2002, Object recognition supported by user interaction for service robots.

[3]  Chun-Jen Chen,et al.  A linear-time component-labeling algorithm using contour tracing technique , 2004, Comput. Vis. Image Underst..

[4]  Ioannis Pratikakis,et al.  Adaptive degraded document image binarization , 2006, Pattern Recognit..

[5]  Lasko Laskov Classification and Recognition of Neume Note Notation in Historical Documents , 2006 .

[6]  G. Louloudisa,et al.  Text line detection in handwritten documents , 2008 .

[7]  Ihsin T. Phillips,et al.  Empirical Performance Evaluation of Graphics Recognition Systems , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Ichiro Fujinaga,et al.  Document Recognition for a Million Books , 2006, D Lib Mag..

[9]  Basilios Gatos,et al.  Handwriting Segmentation Contest , 2007, ICDAR.

[10]  R. Manmatha,et al.  Holistic word recognition for handwritten historical documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[11]  B. Gatos,et al.  Automatic Borders Detection of Camera Document Images , 2007 .

[12]  Friedrich M. Wahl,et al.  Block segmentation and text extraction in mixed text/image documents , 1982, Comput. Graph. Image Process..

[13]  C. Halatsis,et al.  Line And Word Segmentation of Handwritten Documents , 2008 .

[14]  R. Manmatha,et al.  Word spotting for historical documents , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[15]  James Allan,et al.  Text alignment with handwritten documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[16]  Ioannis Pratikakis,et al.  Text line and word segmentation of handwritten documents , 2009, Pattern Recognit..

[17]  Sergios Theodoridis,et al.  Keyword-guided word spotting in historical printed documents using synthetic data and user feedback , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[18]  George Nagy,et al.  Performance metrics for document understanding systems , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[19]  Alan F. Smeaton,et al.  Word matching using single closed contours for indexing handwritten historical documents , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[20]  Stavros J. Perantonis,et al.  A Complete Optical Character Recognition Methodology for Historical Documents , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[21]  Sergios Theodoridis,et al.  Optical character recognition of the Orthodox Hellenic Byzantine Music notation , 2002, Pattern Recognit..