Recognition strategies for general handwritten text documents

This paper presents document recognition strategies for an important application: Recognition of text document containing multiple lines of text data. A project to study the feasibility of recognizing essays written by middle school students is the focus of the second study. In this project, a scanned document is processed to extract individual lines of text from the essay, extract individual words from the line and then apply word recognition techniques to the extracted words. While individual lines of data are extracted accurately using gap information between lines, extraction of words is a much bigger challenge. Since the essays are written by middle school children, word boundaries are ambiguous, especially when words are written in a non-cursive discrete style. In these cases the gaps between words are sometimes smaller than the gaps between characters of the word causing errors in estimating the location of word boundaries. In this paper, we propose two classes of word boundaries: 1) strong boundaries due to large gaps between words, 2) weak boundaries due to small gaps between words. There are also cases when two words do not have a clear gap between them, but are rather joined to give the appearance of a single word. Results obtained from our Phase 1 study will be presented in the paper.