An Efficient Coarse-to-Fine Indexing Technique for Fast Text Retrieval in Historical Documents

In this paper, we present a fast text retrieval system to index and browse degraded historical documents. The indexing and retrieval strategy is designed in a two level, coarse-to-fine approach, to increase the speed of the retrieval process. During the indexing step, the text parts in the images are encoded into sequences of primitives, obtained from two different codebooks: a coarse one corresponding to connected components and a fine one corresponding to glyph primitives. A glyph consists of a single character or a part of a character according to the shape complexity. During the querying step, the coarse and the fine signature are generated from the query image using both codebooks. Then, a bi-level approximate string matching algorithm is applied to find similar words, using coarse approach first, and then the fine approach if necessary, by exploiting predetermined hypothetical locations. An experimental evaluation on datasets of real life document images, gathered from historical books of different scripts, demonstrated the speed improvement and good accuracy in presence of degradation.

[1]  Mohamed Cheriet,et al.  Application of Multi-Level Classifiers and Clustering for Automatic Word Spotting in Historical Document Images , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[2]  Graham A Stephen,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[3]  Ioannis Pratikakis,et al.  Segmentation-free Word Spotting in Historical Printed Documents , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[4]  R. Manmatha,et al.  Word image matching using dynamic time warping , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[5]  Andreas Keller,et al.  HMM-based Word Spotting in Handwritten Documents Using Subword Models , 2010, 2010 20th International Conference on Pattern Recognition.

[6]  Jean-Yves Ramel,et al.  User-driven page layout analysis of historical printed books , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[7]  Josep Lladós,et al.  Browsing Heterogeneous Document Collections by a Segmentation-Free Word Spotting Method , 2011, 2011 International Conference on Document Analysis and Recognition.

[8]  IRHiS,et al.  Bibliothèques Virtuelles Humanistes , 2013 .

[9]  Jean-Yves Ramel,et al.  Word Retrieval in Historical Document Using Character-Primitives , 2011, 2011 International Conference on Document Analysis and Recognition.

[10]  Umapada Pal,et al.  Touching numeral segmentation using water reservoir concept , 2003, Pattern Recognit. Lett..

[11]  Frank Lebourgeois,et al.  Towards an omnilingual word retrieval system for ancient manuscripts , 2009, Pattern Recognit..

[12]  Shijian Lu,et al.  Document Image Retrieval through Word Shape Coding , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.