Transcript mapping for handwritten Arabic documents

Handwriting recognition research requires large databases of word images each of which is labeled with the word it contains. Full images scanned in, however, usually contain sentences or paragraphs of writing. The creation of labeled databases of images of isolated words is usually tedious, requiring a person to drag a rectangle around each word in the full image and type in the label. Transcript mapping is the automatic alignment of words in a text file with word locations in the full image. It can ease the creation of databases for research. We propose the first transcript mapping method for handwritten Arabic documents. Our approach is based on Dynamic Time Warping (DTW) and offers two primary algorithmic contributions. First is an extension to DTW that uses true distances when mapping multiple entries from one series to a single entry in the second series. Second is a method to concurrently map elements of a partially aligned third series within the main alignment. Preliminary results are provided.

[1]  Irccyn,et al.  Tenth international workshop on frontiers in handwriting recognition , 2006 .

[2]  R. Manmatha,et al.  Aligning Transcripts to Automatically Segmented Handwritten Manuscripts , 2006, Document Analysis Systems.

[3]  Venu Govindaraju,et al.  Offline Arabic handwriting recognition: a survey , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Bin Zhang,et al.  Transcript mapping for historic handwritten document images , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[5]  Venu Govindaraju,et al.  Segmentation and pre-recognition of Arabic handwriting , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[6]  James Allan,et al.  Text alignment with handwritten documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..