A comparison of binarization methods for historical archive documents

This paper compares several alternative binarization algorithms for historical archive documents, by evaluating their effect on end-to-end word recognition performance in a complete archive document recognition system utilising a commercial OCR engine. The algorithms evaluated are: global thresholding; Niblack's and Sauvola's algorithms; adaptive versions of Niblack's and Sauvola's algorithms; and Niblack's and Sauvola's algorithms applied to background removed images. We found that, for our archive documents, Niblack's algorithm can achieve better performance than Sauvola's (which has been claimed as an evolution of Niblack's algorithm), and that it also achieved better performance than the internal binarization provided as part of the commercial OCR engine.

[1]  Andy C. Downton,et al.  Colour Map Classification for Archive Documents , 2004, Document Analysis Systems.

[2]  Matti Pietikäinen,et al.  Adaptive document binarization , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[3]  A. C. Downton,et al.  User-assisted OCR enhancement for digital archive construction , 2005 .

[4]  B. Kapralos,et al.  I An Introduction to Digital Image Processing , 2022 .