Word segmentation of handwritten dates in historical documents by combining semantic a-priori-knowledge with local features

The recognition of script in historical documents requires suitable techniques in order to identify single words. Segmentation of lines and words is a challenging task because lines are not straight and words may intersect within and between lines. For correct word segmentation, the conventional analysis of distances between text objects needs to be supplemented by a second component predicting possible word boundaries based on semantical information. For date entries, hypotheses about potential boundaries are generated based on knowledge about the different variations as to how dates are written in the documents. It is modeled by distribution curves for potential boundary locations. Word boundaries are detected by classification of local features, such as distances between adjacent text objects, together with location-based boundary distribution curves as a-priori knowledge. We applied the technique to date entries in historical church registers. Documents from the 18th and 19th century were used for training and testing. The data set consisted of 674 word boundaries in 298 date entries. Our algorithm found the correct separation under the best four hypotheses for a word sequence in 97% of all cases in the test data set.

[1]  Ching Y. Suen,et al.  Word segmentation in handwritten Korean text lines based on gap clustering techniques , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[2]  Suresh Manandhar,et al.  A Hybrid Approach to Word , 1998 .

[3]  Klaus D. Tönnies,et al.  Line detection and segmentation in historical church registers , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[4]  R. Manmatha,et al.  Scale Space Technique for Word Segmentation in Handwritten Documents , 1999, Scale-Space.

[5]  Gyeonghwan Kim,et al.  An architecture for handwritten text recognition systems , 1999, International Journal on Document Analysis and Recognition.

[6]  Horst Bunke,et al.  Handbook of Character Recognition and Document Image Analysis , 1997 .

[7]  Gyeonghwan Kim,et al.  Handwritten phrase recognition as applied to street name images , 1998, Pattern Recognit..

[8]  Proceedings Seventh International Conference on Document Analysis and Recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[9]  Uma Mahadevan,et al.  Gap metrics for word separation in handwritten lines , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[10]  Horst Bunke,et al.  Text line segmentation and word recognition in a system for general writer independent handwriting recognition , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[11]  Giovanni Seni,et al.  External word segmentation of off-line handwritten text lines , 1994, Pattern Recognit..

[12]  M. Shridhar,et al.  SEGMENTATION-BASED CURSIVE HANDWRITING RECOGNITION , 1997 .

[13]  Suresh Manandhar,et al.  A Hybrid Approach t Word Segmentation , 1998, ILP.