论文信息 - Evaluation of a user-assisted archive construction system for online natural history archives

Evaluation of a user-assisted archive construction system for online natural history archives

The creation of structured digital libraries from paper-based archives is an area of growing demand in many scientific and cultural fields, and is not satisfied either by off-the-shelf OCR or commercial form-processing systems. This paper describes and evaluates a configurable archive construction system, which integrates document image pre-processing and analysis with text post-processing tools and a standard OCR package. The prototype system is currently being used in conjunction with the UK Natural History Museum to help convert more than 500,000 cards of Lepidoptera and Coleoptera to a searchable digital archive. Evaluation results are summarised for two datasets comprising over 5,000 cards selected from different parts of this database, and indicate that overall end-to-end word recognition rates of 70-90% are readily achievable for key data fields, subject to availability of suitable electronic dictionaries.

Andy C. Downton | Jingyu He

[1] Robert M. Haralick,et al. Recursive X-Y cut using bounding boxes of connected components , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[2] Wayne Niblack,et al. An introduction to digital image processing , 1986 .

[3] Matti Pietikäinen,et al. Adaptive document image binarization , 2000, Pattern Recognit..

[4] Andy C. Downton,et al. A comparison of binarization methods for historical archive documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[5] Henry S. Baird,et al. Image segmentation by shape-directed covers , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.