Keyword Searching for Arabic Handwritten Documents

In this paper we present a system for searching keywords in Arabic handwritten and historical documents using two algorithms, Dynamic Time Warping (DTW) and Hidden Markov Models (HMM). The HMM based system provides satisfying results when it is possible to provide adequate training samples (which is not always possible in historical documents). The DTW algorithm with a slight modification provides better results even with a small set of training samples. The observation sequences for the matching algorithms are generated by extracting a set of geometric features that already shown to obtain good recognition rates for on-line Arabic handwriting. We have adopted the segmentation-free approach, i.e., continuous word-parts are used as the basic alphabet, instead of the usual alphabet letters. The contours of the complete word-parts are used to represent the shapes of the compared word-parts. Additional strokes, such as dots and detached short segments, which are very common in Arabic scripts, are used via a rule-based system to improve the search algorithm and determine the final comparison decision. The search for a keyword is performed by the search for its word-parts, including the additional strokes, in the right order. The results for our modified DTW algorithm are very encouraging, even when using a small set of samples for training.

[1]  R. Manmatha,et al.  A Statistical Approach to Retrieving Historical Manuscript Images without Recognition , 2003 .

[2]  Oscar E. Agazzi,et al.  Keyword Spotting in Poorly Printed Documents using Pseudo 2-D Hidden Markov Models , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  R. Manmatha,et al.  Indexing of Handwritten Historical Documents - Recent Progress , 2003 .

[4]  R. Manmatha,et al.  Word image matching using dynamic time warping , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[5]  Ioannis Pratikakis,et al.  A segmentation-free approach for keyword search in historical typewritten documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[6]  Dan S. Bloomberg,et al.  Word spotting in scanned images using hidden Markov models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Sargur N. Srihari,et al.  Search engine for handwritten documents , 2005, IS&T/SPIE Electronic Imaging.

[8]  R. Manmatha,et al.  Holistic word recognition for handwritten historical documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[9]  Venu Govindaraju,et al.  Pre-processing methods for handwritten Arabic documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[10]  Harish Srinivasan,et al.  Handwritten Arabic Word Spotting using the CEDARABIC Document Analysis System , 2005 .

[11]  R. Manmatha,et al.  Retrieving Historical Manuscripts using Shape , 2003 .

[12]  R. Manmatha,et al.  Features for word spotting in historical manuscripts , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[13]  Ching Y. Suen,et al.  Extraction of text areas in printed document images , 2001, DocEng '01.

[14]  R. Manmatha,et al.  Indexing for a Digital Library of George Washington’s Manuscripts: A Study of Word Matching Techniques , 2002 .

[15]  Sargur N. Srihari,et al.  Spotting Words in Latin , Devanagari and Arabic Scripts , 2006 .

[16]  Shaolei Feng,et al.  Using Corner Feature Correspondences to Rank Word Images by Similarity , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.