Word Retrieval in Historical Document Using Character-Primitives

Word searching and indexing in historical document collections is a challenging problem because, characters in these documents are often touching or broken due to degradation/ ageing effects. For efficient searching in such historical documents, this paper presents a novel approach towards word spotting using string matching of character primitives. We describe the text string as a sequence of primitives which consists of a single character or a part of a character. Primitive segmentation is performed analyzing text background information that is obtained by water reservoir technique. Next, the primitives are clustered using template matching and a codebook of representative primitives is built. Using this primitive codebook, the text information in the document images are encoded and stored. For a query word, we segment it into primitives and encode the word by a string of representative primitives from codebook. Finally, an approximate string matching is applied to find similar words. The matching similarity is used to rank the retrieved words. The proposed method is tested on historical books of French alphabets and we have obtained encouraging results from the experiment.

[1]  Giovanni Soda,et al.  Indexing and retrieval of words in old documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[2]  Jean-Yves Ramel,et al.  User-driven page layout analysis of historical printed books , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[3]  Ioannis Pratikakis,et al.  Segmentation-free Word Spotting in Historical Printed Documents , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[4]  Mohamed Cheriet,et al.  Application of Multi-Level Classifiers and Clustering for Automatic Word Spotting in Historical Document Images , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[5]  R. Manmatha,et al.  Word image matching using dynamic time warping , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[6]  Frank Lebourgeois,et al.  Towards an omnilingual word retrieval system for ancient manuscripts , 2009, Pattern Recognit..

[7]  Shijian Lu,et al.  Document Image Retrieval through Word Shape Coding , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Umapada Pal,et al.  Touching numeral segmentation using water reservoir concept , 2003, Pattern Recognit. Lett..

[9]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.