Indexing Historical Documents by Word Shape Signatures

In this paper a word spotting approach to index archival image documents is presented. Indices are constructed from keyword images. The spotting strategy is formulated on an indexing-by-shape basis. The well known shape context descriptor is used to compute word image signatures from the skeleton points. Afterwards, codewords are extracted from thresholded shape contexts. It is a simpler and more compact representation based on bit vectors. Document images are roughly segmented into words and a lookup table is constructed. Each word subimage is taken as a bin. Keyword images are spotted into documents by a voting strategy consisting in indexing into the lookup table by codewords, and voting into the corresponding bins. The approach is illustrated by a real application scenario consisting of documents from a digital archive of the Spanish Civil War.

[1]  Andy C. Downton,et al.  Evaluation of a user-assisted archive construction system for online natural history archives , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[2]  Bin Zhang,et al.  Transcript mapping for historic handwritten document images , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[3]  R. Manmatha,et al.  Word image matching using dynamic time warping , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[4]  Frank Lebourgeois,et al.  Automatic Metadata Retrieval from Ancient Manuscripts , 2004, Document Analysis Systems.

[5]  Apostolos Antonacopoulos,et al.  A Complete Approach to the Conversion of Typewritten Historical Documents for Digital Archives , 2004, Document Analysis Systems.

[6]  Jean-Yves Ramel,et al.  Text/graphic labelling of ancient printed documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[7]  Dana H. Ballard,et al.  Generalizing the Hough transform to detect arbitrary shapes , 1981, Pattern Recognit..

[8]  Guojun Lu,et al.  Review of shape representation and description techniques , 2004, Pattern Recognit..

[9]  Sven Loncaric,et al.  A survey of shape analysis techniques , 1998, Pattern Recognit..

[10]  Venu Govindaraju,et al.  Document Analysis Systems for Digital Libraries: Challenges and Opportunities , 2004, Document Analysis Systems.

[11]  Haim J. Wolfson,et al.  Geometric hashing: an overview , 1997 .

[12]  Jean Camillerapp,et al.  Making handwritten archives documents accessible to public with a generic system of document image analysis , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..