The UvA color document dataset

Abstract.Publications on color document image analysis present results on small, nonpublicly available datasets. In this paper we propose a well-defined and groundtruthed color dataset consisting of over 1000 pages, with associated tools for evaluation. As we focus on aspects specific to color documents, we leave out the document textual content in the ground truth. The color data groundtruthing and evaluation tools are based on a well-defined document model, complexity measures to assess the inherent difficulty of analyzing a page, and well-founded evaluation measures. Together they form a suitable basis for evaluating diverse applications in color document analysis. Both the dataset and the tools are available through our Web site at http: //www.science.uva.nl/UvA-CDD

[1]  Dennis Koelma,et al.  Efficient applications in user transparent parallel image processing , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[2]  Arnold W. M. Smeulders,et al.  Grouping lines. Finding curvilinear structures in images , 2001 .

[3]  George Nagy,et al.  Automated Evaluation of OCR Zoning , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Hiroshi Maruyama,et al.  Character string extraction from a color document , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[5]  Xian-Sheng Hua,et al.  Automatic performance evaluation for video text detection , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[6]  T. Sivatanabe,et al.  Layout analysis of complex documents , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[7]  Robert M. Haralick,et al.  An Optimization Methodology for Document Structure Extraction on Latin Character Documents , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Seong-Whan Lee,et al.  Parameter-independent geometric document layout analysis , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[9]  Marcel Worring,et al.  Multimodal Video Indexing : A Review of the State-ofthe-art , 2001 .

[10]  Edward M. Riseman,et al.  TextFinder: An Automatic System to Detect and Recognize Text In Images , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Yann LeCun,et al.  DjVu: analyzing and compressing scanned documents for Internet distribution , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[12]  C. Garcia,et al.  Text detection and segmentation in complex color images , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[13]  Arnold W. M. Smeulders,et al.  Statistical strategy for object class recognition using part detectors , 2001 .

[14]  Gregory K. Wallace,et al.  The JPEG still picture compression standard , 1992 .

[15]  Haruo Asada,et al.  Major components of a complete text reading system , 1992 .

[16]  Anil K. Jain,et al.  Locating text in complex color images , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[17]  Shu-Yuan Chen,et al.  Adaptive page segmentation for color technical journals' cover images , 1998, Image Vis. Comput..

[18]  Marcel Worring,et al.  Logical structure detection for heterogeneous document classes , 2000, IS&T/SPIE Electronic Imaging.

[19]  A.W.M. Smeulders,et al.  Requirements for generic grouping in vision and an algorithm , 2001 .

[20]  Rainer Hoch,et al.  On the evaluation of document analysis components by recall, precision, and accuracy , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[21]  Horst Bunke,et al.  Identification of text on colored book and journal covers , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[22]  Lawrence O. Hall,et al.  Text extraction from color documents-clustering approaches in three and four dimensions , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[23]  Richard Rogers,et al.  UW-ISL document image analysis toolbox: an experimental environment , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[24]  Arnold W. M. Smeulders,et al.  A casestudy in performance analysis of recognition of graphical signs. Detecting Arrows , 2001 .

[25]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Matti Pietikäinen,et al.  A distributed management system for testing document image analysis algorithms , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[27]  Song Mao,et al.  Empirical Performance Evaluation Methodology and Its Application to Page Segmentation Algorithms , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Luc Vincent,et al.  Pink Panther: A Complete Environment For Ground-Truthing And Benchmarking Document Page Segmentation , 1998, Pattern Recognit..

[29]  Arnold W. M. Smeulders,et al.  A line tracker , 1997 .

[30]  Anil K. Jain,et al.  Automatic text location in images and video frames , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).