Truthing for Pixel-Accurate Segmentation

We discuss problems in developing policies for ground truthing document images for pixel-accurate segmentation. First, we describe ground truthing policies that apply to four different scales: (1) paragraph, (2) text line, (3) character, and (4) pixel. We then analyze difficult and/or ambiguous cases that will challenge any policy, e.g. blank space, overlapping content, etc. Experiments have shown the benefit of using "tighter'' zones that capture more detail (e.g., at the text line level, instead of paragraph). We show that tighter ground truth does significantly improve classification results, by 45% in recent experiments. It is important to face the fact that a pixel-accurate segmentation can be better than manually obtained ground truth. In practice, perfectly accurate pixel-level ground truth may not be achievable of course, but we believe it is important to explore methods to semi-automatically improve existing ground truth.

[1]  Henry S. Baird,et al.  Towards Versatile Document Analysis Systems , 2006, Document Analysis Systems.

[2]  Apostolos Antonacopoulos,et al.  ICDAR 2009 Page Segmentation Competition , 2003, 2009 10th International Conference on Document Analysis and Recognition.

[3]  Henry S. Baird,et al.  Iterated Document Content Classification , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[4]  Henry S. Baird,et al.  Segmentation-based retrieval of document images from diverse collections , 2008, Electronic Imaging.

[5]  Apostolos Antonacopoulos,et al.  Ground Truth for Layout Analysis Performance Evaluation , 2006, Document Analysis Systems.

[6]  Steven J. Simske,et al.  A ground-truthing engine for proofsetting, publishing, re-purposing and quality assurance , 2003, DocEng '03.

[7]  Henry S. Baird,et al.  Document Content Inventory and Retrieval , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[8]  Basilios Gatos,et al.  Page Segmentation Competition , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[9]  Philip A. Chou,et al.  Document Image Decoding Using Markov Source Models , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Luc Vincent,et al.  Pink Panther: A Complete Environment For Ground-Truthing And Benchmarking Document Page Segmentation , 1998, Pattern Recognit..

[11]  Thomas M. Breuel,et al.  Pixel-Accurate Representation and Evaluation of Page Segmentation in Document Images , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[12]  Matthew R. Casey FAST APPROXIMATE NEAREST NEIGHBORS , 2006 .

[13]  Henry S. Baird,et al.  Versatile document image content extraction , 2006, Electronic Imaging.

[14]  Henry S. Baird,et al.  Document image content inventories , 2007, Electronic Imaging.