The Convergence of Iterated Classification

We report an improved methodology for training a sequence of classifiers for document image content extraction, that is, the location and segmentation of regions containing handwriting, machine-printed text, photographs, blank space, etc. The resulting segmentation is pixel-accurate, and so accommodates a wide range of zone shapes (not merely rectangles). We have systematically explored the best scale (spatial extent) of features. We have found that the methodology is sensitive to ground-truthing policy, and especially to precision of ground-truth boundaries. Experiments on a diverse test set of 83 document images show that tighter ground-truth reduces per-pixel classification errors by 45% (from 38.9% to 21.4%). Strong evidence, from both experiments and simulation, suggests that iterated classification converges region boundaries to the ground-truth (i.e. they don't drift). Experiments show that four-stage iterated classifiers reduce the error rates by 24%. We also present an analysis of special cases suggesting reasons why boundaries converge to the ground-truth.

[1]  Henry S. Baird,et al.  Document Content Inventory & Retrieval , 2007 .

[2]  Thierry Paquet,et al.  Document Image Segmentation Using a 2D Conditional Random Field Model , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[3]  Henry S. Baird,et al.  Towards Versatile Document Analysis Systems , 2006, Document Analysis Systems.

[4]  Henry S. Baird,et al.  Iterated Document Content Classification , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[5]  J. Roerdink,et al.  MATHEMATICAL MORPHOLOGY AND ITS APPLICATIONS TO IMAGE PROCESSING , 1994 .

[6]  Henry S. Baird,et al.  Versatile document image content extraction , 2006, Electronic Imaging.

[7]  Olga Veksler,et al.  Fast Approximate Energy Minimization via Graph Cuts , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Matthew R. Casey FAST APPROXIMATE NEAREST NEIGHBORS , 2006 .

[9]  Ethem Alpaydin,et al.  Cascading classifiers , 1998, Kybernetika.

[10]  Henry S. Baird,et al.  Document Content Inventory and Retrieval , 2007 .

[11]  Sunil Kumar,et al.  Text Extraction and Document Image Segmentation Using Matched Wavelets and MRF Model , 2007, IEEE Transactions on Image Processing.