A Complete Optical Character Recognition Methodology for Historical Documents

In this paper a complete OCR methodology for recognizing historical documents, either printed or handwritten without any knowledge of the font, is presented. This methodology consists of three steps: The first two steps refer to creating a database for training using a set of documents, while the third one refers to recognition of new document images. First, a pre-processing step that includes image binarization and enhancement takes place. At a second step a top-down segmentation approach is used in order to detect text lines, words and characters. A clustering scheme is then adopted in order to group characters of similar shape. This is a semi-automatic procedure since the user is able to interact at any time in order to correct possible errors of clustering and assign an ASCII label. After this step, a database is created in order to be used for recognition. Finally, in the third step, for every new document image the above segmentation approach takes place while the recognition is based on the character database that has been produced at the previous step.

[1]  Ichiro Fujinaga,et al.  The Gamera framework for building custom recognition systems , 2003 .

[2]  R. Manmatha,et al.  Holistic word recognition for handwritten historical documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[3]  Basilios Gatos,et al.  Text Line Detection in Unconstrained Handwritten Documents Using a Block-Based Hough Transform Approach , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[4]  Ioannis Pratikakis,et al.  Adaptive degraded document image binarization , 2006, Pattern Recognit..

[5]  Lasko Laskov Classification and Recognition of Neume Note Notation in Historical Documents , 2006 .

[6]  Basilios Gatos,et al.  An Efficient Feature Extraction and Dimensionality Reduction Scheme for Isolated Greek Handwritten Character Recognition , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[7]  Ichiro Fujinaga,et al.  Using the Gamera framework for the recognition of cultural heritage materials , 2002, JCDL '02.

[8]  Ioannis Pratikakis,et al.  A segmentation-free approach for keyword search in historical typewritten documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[9]  Alan F. Smeaton,et al.  Word matching using single closed contours for indexing handwritten historical documents , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[10]  Nikolaos Stamatopoulos,et al.  An Efficient Feature Extraction and Dimensionality Reduction Scheme for Isolated Greek Handwritten Character Recognition , 2007 .

[11]  R. Manmatha,et al.  Word spotting for historical documents , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[12]  Sergios Theodoridis,et al.  Keyword-guided word spotting in historical printed documents using synthetic data and user feedback , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[13]  Apostolos Antonacopoulos,et al.  Document image analysis for World War II personal records , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[14]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[15]  Ioannis Pratikakis,et al.  An old greek handwritten OCR system based on an efficient segmentation-free approach , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[16]  Ichiro Fujinaga,et al.  Document Recognition for a Million Books , 2006, D Lib Mag..

[17]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[18]  B. Kapralos,et al.  I An Introduction to Digital Image Processing , 2022 .

[19]  Sergios Theodoridis,et al.  Optical character recognition of the Orthodox Hellenic Byzantine Music notation , 2002, Pattern Recognit..

[20]  Jhing-Fa Wang,et al.  Segmentation of Single- or Multiple-Touching Handwritten Numeral String Using Background and Foreground Analysis , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[22]  Basilios Gatos,et al.  Handwriting Segmentation Contest , 2007, ICDAR.