We describe a method for rapid prototyping of contextual analysis algorithms within an experimental page reader. Due to the great variety of such algorithms and their dependency on details of the page-reader’s internal data structures, the state of the art today is that each new application requires custom low-level programming. This is undesirable since it impedes experimentation and restricts cost-effective applications to high-volume problems. To make contextual analysis more easily retargetable, we have designed a high-level language with primitives for traversing the document hierarchy and generating, scoring, sorting, and pruning interpretations. It is based on Ousterhout’s interpreted language tcl which provides constructs such as variables, decisions, looping, etc, and is easily extended by adding functions. Some functions are table-driven or built-in: for example, character typing, typographical morphology analysis, and regular expression matching. Other functions are normally implemented as separately executing UNIX processes communicating with the page reader via pipes. These may be pre-existing software tools imported from other research fields such as computational linguistics, information retrieval, and string matching. We illustrate the expressive power of the language in applications to English text using a spell-checker, Japanese text using character n-grams, and mixed Russian-English text using two lexicons with automatic context-switching.
[1]
Ken Thompson,et al.
Reading Chess
,
1990,
IEEE Trans. Pattern Anal. Mach. Intell..
[2]
A. J. Lohwater.
Russian-English Dictionary of the Mathematical Sciences
,
1990
.
[3]
K. S. Baird,et al.
Anatomy of a versatile page reader
,
1992,
Proc. IEEE.
[4]
Henry S. Baird,et al.
Language-free layout analysis
,
1993,
Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).
[5]
Robert M. Haralick,et al.
CD-ROM document database standard
,
1993,
Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).
[6]
Henry S. Baird,et al.
A family of European page readers
,
1994,
Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).
[7]
Henry S. Baird,et al.
DATA STRUCTURES FOR PAGE READERS
,
1995
.