A Holistic Methodology for Keyword Search in Historical Typewritten Documents

In this paper, we propose a novel holistic methodology for keyword search in historical typewritten documents combining synthetic data and user's feedback. The holistic approach treats the word as a single entity and entails the recognition of the whole word rather than of individual characters. Our aim is to search for keywords typed by the user in a large collection of digitized typewritten historical documents. The proposed method is based on: (i) creation of synthetic image words; (ii) word segmentation using dynamic parameters; (iii) efficient hybrid feature extraction for each image word and (iv) a retrieval procedure that is optimized by user's feedback. Experimental results prove the efficiency of the proposed approach.

[1]  Dinkar N. Bhat An evolutionary measure for image matching , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[2]  R. Manmatha,et al.  Features for word spotting in historical manuscripts , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[3]  Christodoulos Chamzas,et al.  A binary-tree-based OCR technique for machine-printed characters , 1997 .

[4]  Mindy Bokser,et al.  Omnidocument technologies , 1992, Proc. IEEE.

[5]  Yue Lu,et al.  An approach to word image matching based on weighted Hausdorff distance , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[6]  Venu Govindaraju,et al.  Local reference lines for handwritten phrase recognition , 1999, Pattern Recognit..

[7]  Ching Y. Suen,et al.  Segmenting document images using diagonal white runs and vertical edges , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[8]  Henry S. Baird,et al.  The State of the Art of Document Image Degradation Modelling , 2007 .

[9]  João Rogério Caldas Pinto,et al.  Line and Word Matching in Old Documents , 2004, ArXiv.

[10]  Chew Lim Tan,et al.  Word shape recognition for image-based document retrieval , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[11]  R. Manmatha,et al.  A scale space approach for automatically segmenting words from historical handwritten documents , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.