Why table ground-truthing is hard

The principle that for every document analysis task there exists a mechanism for creating well-defined ground-truth is a widely held tenet. Past experience with standard datasets providing ground-truth for character recognition and page segmentation tasks supports this belief. In the process of attempting to evaluate several table recognition algorithms we have been developing, however, we have uncovered a number of serious hurdles connected with the ground-truthing of tables. This problem may, in fact, be much more difficult than it appears. We present a detailed analysis of why table ground-truthing is so hard, including the notions that there may exist more than one acceptable "truth" and/or incomplete or partial "truths".

[1]  Robert M. Haralick,et al.  CD-ROM document database standard , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[2]  Daniel P. Lopresti,et al.  Table structure recognition and its evaluation , 2000, IS&T/SPIE Electronic Imaging.

[3]  Xinxin Wang,et al.  Tabular Abstraction, Editing, and Formatting , 1996 .

[4]  Daniel P. Lopresti,et al.  A Tabular Survey of Automated Table Processing , 1999, GREC.

[5]  K. S. Baird,et al.  Anatomy of a versatile page reader , 1992, Proc. IEEE.

[6]  George Nagy DOCUMENT IMAGE ANALYSIS: AUTOMATED PERFORMANCE EVALUATION , 1995 .