Historical Typewritten Document Recognition Using Minimal User Interaction

Recognition of low-quality historical typewritten documents can still be considered as a challenging and difficult task due to several issues i.e. the existence of faint and degraded characters, stains, tears, punch holes etc. In this paper, we exploit the unique characteristics of historical typewritten documents in order to propose an efficient recognition methodology that requires minimum user interaction. It is based on a pre-processing stage in order to enhance the quality and extract connected components, on a semi-supervised clustering for detecting the most representative character samples and on a segmentation-free recognition stage based on a template matching and cross-correlation technique. Experimental results prove that even with minimum user interaction, the proposed method can lead to promising accuracy results.

[1]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[2]  J. P. Lewis Fast Normalized Cross-Correlation , 2010 .

[3]  Matti Pietikäinen,et al.  Adaptive document image binarization , 2000, Pattern Recognit..

[4]  Apostolos Antonacopoulos,et al.  Semantics-based content extraction in typewritten historical documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[5]  Apostolos Antonacopoulos,et al.  Flexible Text Recovery from Degraded Typewritten Historical Documents , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[6]  Rafael Dueire Lins,et al.  An Automatic Method for Enhancing Character Recognition in Degraded Historical Documents , 2011, 2011 International Conference on Document Analysis and Recognition.

[7]  Asaf Tzadok,et al.  User Collaboration for Improving Access to Historical Texts , 2010 .

[8]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[9]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[10]  Stanley R Sternberg,et al.  Grayscale morphology , 1986 .

[11]  Jianying Hu,et al.  A New Framework for Recognition of Heavily Degraded Characters in Historical Typewritten Documents Based on Semi-Supervised Clustering , 2009, 2009 10th International Conference on Document Analysis and Recognition.