Image processing for historical newspaper archives

This paper presents some image processing methods that could produce accurate character segmentation results for historical newspaper archives. A full text search using a word spotting technique is no doubt a promising approach in order to facilitate the utilization of digital archives. Some word spotting techniques require the target images to be segmented into character images in advance, however character segmentation is a difficult issue especially for old and degraded document images. This paper figures out the causes that make the character segmentation difficult, and removes them in order to improve the accuracy of character segmentation. We first detect the ruled lines using Hough Transform in order to segment a whole newspaper image into column-separated images. Then we remove the ruled lines as well as ruby characters and noise. The proposed system is tested for 20 column-separated images of historical newspapers, and the accuracy of character segmentation is improved to 96.3%.

[1]  Yonina C. Eldar,et al.  A probabilistic Hough transform , 1991, Pattern Recognit..

[2]  Edward M. Riseman,et al.  Word spotting: a new approach to indexing handwriting , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  Kengo Terasawa,et al.  Eigenspace method for text retrieval in historical document images , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[4]  H. Chandler Database , 1985 .

[5]  Yuzuru Tanaka,et al.  Slit Style HOG Feature for Document Image Word Spotting , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[6]  Basilios Gatos,et al.  A comprehensive evaluation methodology for noisy historical document recognition techniques , 2009, AND '09.