A search engine for historical manuscript images

Many museum and library archives are digitizing their large collections of handwritten historical manuscripts to enable public access to them. These collections are only available in image formats and require expensive manual annotation work for access to them. Current handwriting recognizers have word error rates in excess of 50% and therefore cannot be used for such material. We describe two statistical models for retrieval in large collections of handwritten manuscripts given a text query. Both use a set of transcribed page images to learn a joint probability distribution between features computed from word images and their transcriptions. The models can then be used to retrieve unlabeled images of handwritten documents given a text query. We show experiments with a training set of 100 transcribed pages and a test set of 987 handwritten page images from the George Washington collection. Experiments show that the precision at 20 documents is about 0.4 to 0.5 depending on the model. To the best of our knowledge, this is the first automatic retrieval system for historical manuscripts using text queries, without manual transcription of the original corpus.

[1]  W. Bruce Croft,et al.  Cross-lingual relevance models , 2002, SIGIR '02.

[2]  R. Manmatha,et al.  Scale Space Technique for Word Segmentation in Handwritten Documents , 1999, Scale-Space.

[3]  R. Manmatha,et al.  Word image matching using dynamic time warping , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[4]  R. Manmatha,et al.  Word spotting: indexing handwritten manuscripts , 1997 .

[5]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[6]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[7]  Jimmie D. Lawson,et al.  Presentation , 2000, MFCSIT.

[8]  R. Manmatha,et al.  Retrieving Historical Manuscripts using Shape , 2003 .

[9]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[10]  Horst Bunke,et al.  Using a Statistical Language Model to Improve the Performance of an HMM-Based Cursive Handwriting Recognition System , 2001, Int. J. Pattern Recognit. Artif. Intell..

[11]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[12]  R. Manmatha,et al.  Scale Space Technique for Word Segmentation in Handwritten Manuscripts , 1999 .

[13]  R. Manmatha,et al.  Holistic word recognition for handwritten historical documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[14]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.