This paper presents document recognition strategies for an important application: Recognition of text document containing multiple lines of text data. A project to study the feasibility of recognizing essays written by middle school students is the focus of the second study. In this project, a scanned document is processed to extract individual lines of text from the essay, extract individual words from the line and then apply word recognition techniques to the extracted words. While individual lines of data are extracted accurately using gap information between lines, extraction of words is a much bigger challenge. Since the essays are written by middle school children, word boundaries are ambiguous, especially when words are written in a non-cursive discrete style. In these cases the gaps between words are sometimes smaller than the gaps between characters of the word causing errors in estimating the location of word boundaries. In this paper, we propose two classes of word boundaries: 1) strong boundaries due to large gaps between words, 2) weak boundaries due to small gaps between words. There are also cases when two words do not have a clear gap between them, but are rather joined to give the appearance of a single word. Results obtained from our Phase 1 study will be presented in the paper.
[1]
Rohini K. Srihari,et al.
Automatic scoring of short handwritten essays in reading comprehension tests
,
2008,
Artif. Intell..
[2]
Fumitaka Kimura,et al.
Handwritten numerical recognition based on multiple algorithms
,
1991,
Pattern Recognit..
[3]
Fumitaka Kimura,et al.
Improvements of a lexicon directed algorithm for recognition of unconstrained handwritten words
,
1993,
Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).
[4]
Malayappan Shridhar,et al.
A Lexicon Directed Algorithm for Recognition of Unconstrained Handwritten Words (Special Issue on Document Analysis and Recognition)
,
1994
.
[5]
Gyeonghwan Kim,et al.
A Lexicon Driven Approach to Handwritten Word Recognition for Real-Time Applications
,
1997,
IEEE Trans. Pattern Anal. Mach. Intell..
[6]
Carson T. Schütze.
The empirical base of linguistics: Grammaticality judgments and linguistic methodology
,
1998
.