Zone classification in a document using the method of feature vector generation

A document can be divided into zones on the basis of its content. For example, a zone can be either text or non-text. This paper describes an algorithm to classify each given document zone into one of nine different classes. Features for each zone such as run length mean and variance, spatial mean and variance, fraction of the total number of black pixels in the zone, and the zone width ratio for each zone are extracted. Run length related features are computed along four different canonical directions. A decision tree classifier is used to assign a zone class on the basis of its feature vector. The performance on an independent test set was 97%.

[1]  Linda G. Shapiro,et al.  Computer and Robot Vision , 1991 .

[2]  Robert M. Haralick,et al.  CD-ROM document database standard , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[3]  B. Chandrasekaran,et al.  Comments on "On the mean accuracy of statistical pattern recognizers" by Hughes, G. F , 1969, IEEE Trans. Inf. Theory.

[4]  Chi Hau Chen,et al.  Statistical Pattern Recognition. , 1973 .

[5]  G. F. Hughes,et al.  On the mean accuracy of statistical pattern recognizers , 1968, IEEE Trans. Inf. Theory.