Indexing Historical Documents by Word Shape Signatures

In this paper a word spotting approach to index archival image documents is presented. Indices are constructed from keyword images. The spotting strategy is formulated on an indexing-by-shape basis. The well known shape context descriptor is used to compute word image signatures from the skeleton points. Afterwards, codewords are extracted from thresholded shape contexts. It is a simpler and more compact representation based on bit vectors. Document images are roughly segmented into words and a lookup table is constructed. Each word subimage is taken as a bin. Keyword images are spotted into documents by a voting strategy consisting in indexing into the lookup table by codewords, and voting into the corresponding bins. The approach is illustrated by a real application scenario consisting of documents from a digital archive of the Spanish Civil War.

[1]  Bin Zhang,et al.  Transcript mapping for historic handwritten document images , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[2]  Dana H. Ballard,et al.  Generalizing the Hough transform to detect arbitrary shapes , 1981, Pattern Recognit..

[3]  Haim J. Wolfson,et al.  Geometric hashing: an overview , 1997 .

[4]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[5]  Apostolos Antonacopoulos,et al.  A Complete Approach to the Conversion of Typewritten Historical Documents for Digital Archives , 2004, Document Analysis Systems.

[6]  R. Manmatha,et al.  Word image matching using dynamic time warping , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[7]  Jean-Yves Ramel,et al.  Text/graphic labelling of ancient printed documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[8]  Andy C. Downton,et al.  Evaluation of a user-assisted archive construction system for online natural history archives , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[9]  Venu Govindaraju,et al.  Document analysis systems archi-tectures for digital libraries , 2004 .

[10]  Jean Camillerapp,et al.  Making handwritten archives documents accessible to public with a generic system of document image analysis , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[11]  Frank Lebourgeois,et al.  Automatic Metadata Retrieval from Ancient Manuscripts , 2004, Document Analysis Systems.

[12]  Sven Loncaric,et al.  A survey of shape analysis techniques , 1998, Pattern Recognit..

[13]  Guojun Lu,et al.  Review of shape representation and description techniques , 2004, Pattern Recognit..