Stochastic attributed K-d tree modeling of technical paper title pages

Structural information about a document is essential for structured query processing, indexing, and retrieval. A document page can be partitioned into a hierarchy of homogeneous regions such as columns, paragraphs, etc.; these regions are called physical components, and define the physical layout of the page. In this paper we develop a class of models for the physical layouts of technical paper title pages. We model physical layout using hidden semiMarkov models for directional projections of page regions, and a stochastic attributed K-d tree grammar model for the 2D hierarchical structure of these regions. We use the models to generate sets of synthetic title page images of three distinctive styles, which we use in controlled experiments on page structure analysis.

[1]  Robert M. Haralick,et al.  Nonlinear global and local document degradation models , 1994, Int. J. Imaging Syst. Technol..

[2]  Philip A. Chou,et al.  Document Image Decoding Using Markov Source Models , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  David R. Cox,et al.  The Theory of Stochastic Processes , 1967, The Mathematical Gazette.

[4]  Mahesh Viswanathan,et al.  Syntactic Segmentation and Labeling of Digitized Pages from Technical Journals , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Philip A. Chou,et al.  Turbo recognition: a statistical approach to layout analysis , 2000, IS&T/SPIE Electronic Imaging.

[6]  Taku A. Tokuyasu Turbo recognition: decoding page layout , 2001, JCDL '01.