This paper presents an improved zone content classification method. Motivated by our novel background-analysis-based table identification research, we added two new features to the feature vector from one previously published method [7]. The new features are the total area of large horizontal and large vertical blank blocks and the number of text glyphs in the zone. A binary decision tree is used to assign a zone class on the basis of its feature vector. The training and testing data sets for the algorithm include images drawn from the UWCDROM-III document image database. The classifier is able to classify each given scientific and technical document zone into one of the nine classes, text classes (of font size pt and font size pt), math, table, halftone, map/drawing, ruling, logo, and others. The improved zone classification method raised the accuracy rate to from and reduced the median false alarm rate to from . 1 Problem Statement Let ! be a set of zone entities. Let " be a set of content labels, such as text, table, math, etc. The function #%$&!(')" associates each element of * with a label. The function +,$-!.'0/ specifies measurements made on each element of ! , where / is the measurement space. The zone content classification problem can be formulated as follows: Given a zone set ! and a content label set " , find a classification function #1$2!3'4" , that has the maximum probability: 576 # 6 !98;: + 6 !98 8 (1) In our current approach, we assume conditional independence between the zone classifications, so the probability in Equation 1 may be decomposed as 576 # 6 !98;: + 6 !98 8= @ A B 576 # 6DC 8;: + 6DC 8 8 (2) The problem can be solved by maximizing each individual probability 576 # 6DC 8;: + 6DC 8 8 in Equation 2, where CFE ! . In our zone content classification experiment, the elements in set ! are zone groundtruth entities from UWCDROM III document image database [6]. The elements of set " are text with font size GIH J pt, text with font size KLH M pt, math, table, halftone, map/drawing, ruling, logo, and others. + 6DC 8 is a feature vector generated for C , where CNE ! . We used a decision tree classifier to compute the probability in Equation 2 and make the assignment. 2 Related Work and Paper Organization A complete document image understanding system can transform paper documents into a hierarchical representation of their structure and content. The transformed document representation enables document interchange, editing, browsing, indexing, filing and retrieval. The zone classification technique plays the key role in the success of such a document understanding system. Not only is it useful for successive applications such as OCR, table understanding, etc, but it can be used to assist and validate document segmentation. In the literature, Sivaramakrishnan et. al [7] extracted features for each zone such as run length mean and variance, spatial mean and variance, fraction of the total number of black pixels in the zone, and the zone width ratio for each zone. They used the decision tree classifier to assign a zone class on the basis of its feature vector. They did their experiments on MPO M document images from UWCDROM I image database. Liang et. al [5] developed a feature based zone classifier using only the knowledge of the widths and the heights of the connected components within a given zone. Le et. al [4] proposed an automated labeling of zones from scanned images with labels such as titles, authors, affiliations and abstracts. The labeling is based on features calculated from optical character recognition(OCR) output, neural network models, machine learning methods, and a set of rules that is derived from an analysis of the page layout for each journal and from generic typesetting knowledge for English text. We developed a novel background-analysis-based table identification technique. We repeated Sivaramakrishnan et. al’s work [7] on a larger database with a goal to improve its performance on table zone classification. Although some background analysis techniques can be found in the literature([1],[2]), none of them, to our knowledge, has been used in the table identification problem. We added two new features: the total area of large horizontal and vertical blank blocks and the number of text glyphs in the given zone, to the original feature vector. We improved the accuracy rate and reduced the false alarm rates for most of the nine classes. The rest of this paper is divided into Q sections. section 3 gives the definitions of large horizontal and large vertical blank blocks. The two new features are described in section 4. A brief introduction to the decision tree classifier is given in section 5. The experimental results are reported in section 6. Our conclusion and statement of future work are discussed in section 7.
[1]
Robert M. Haralick,et al.
Document zone classification using sizes of connected components
,
1996,
Electronic Imaging.
[2]
Daniel X. Le,et al.
Automated Labeling of Zones from Scanned Documents
,
1999
.
[3]
Emanuele Trucco,et al.
Computer and Robot Vision
,
1995
.
[4]
Linda G. Shapiro,et al.
Computer and Robot Vision
,
1991
.
[5]
Robert M. Haralick,et al.
Zone classification in a document using the method of feature vector generation
,
1995,
Proceedings of 3rd International Conference on Document Analysis and Recognition.
[6]
Henry S. Baird.
Background Structure in Document Images
,
1994,
Int. J. Pattern Recognit. Artif. Intell..
[7]
Apostolos Antonacopoulos,et al.
Page Segmentation Using the Description of the Background
,
1998,
Comput. Vis. Image Underst..