The IUPR Dataset of Camera-Captured Document Images

Major challenges in camera-base document analysis are dealing with uneven shadows, high degree of curl and perspective distortions. In CBDAR 2007, we introduced the first dataset (DFKI-I) of camera-captured document images in conjunction with a page dewarping contest. One of the main limitations of this dataset is that it contains images only from technical books with simple layouts and moderate curl/skew. Moreover, it does not contain information about camera's specifications and settings, imaging environment, and document contents. This kind of information would be more helpful for understanding the results of the experimental evaluation of camera-based document image processing (binarization, page segmentation, dewarping, etc.). In this paper, we introduce a new dataset (the IUPR dataset) of camera-captured document images. As compared to the previous dataset, the new dataset contains images from different varieties of technical and non-technical books with more challenging problems, like different types of layouts, large variety of curl, wide range of perspective distortions, and high to low resolutions. Additionally, the document images in the new dataset are provided with detailed information about thickness of books, imaging environment and camera's viewing angle and its internal settings. The new dataset will help research community to develop robust camera-captured document processing algorithms in order to solve the challenging problems in the dataset and to compare different methods on a common ground.

[1]  William M. Newman,et al.  Documents through cameras , 1999, Image Vis. Comput..

[2]  Stephen V. Rice,et al.  The Fourth Annual Test of OCR Accuracy , 1995 .

[3]  Christoph H. Lampert,et al.  Document image dewarping using robust estimation of curled text lines , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[4]  Syed Saqib Bukhari,et al.  Border Noise Removal of Camera-Captured Document Images Using Page Frame Detection , 2011, CBDAR.

[5]  Isabelle Guyon,et al.  DATA SETS FOR OCR AND DOCUMENT IMAGE UNDERSTANDING RESEARCH , 1997 .

[6]  Syed Saqib Bukhari,et al.  Dewarping of Document Images using Coupled-Snakes , 2009 .

[7]  Thomas M. Breuel,et al.  The Future of Document Imaging in the Era of Electronic Documents , 2004 .

[8]  Syed Saqib Bukhari,et al.  Performance evaluation of curled textline segmentation algorithms on CBDAR 2007 dewarping contest dataset , 2010, 2010 IEEE International Conference on Image Processing.

[9]  Faisal Shafait Document Image Dewarping Contest , 2007 .

[10]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[11]  David S. Doermann,et al.  Camera-based analysis of text and documents: a survey , 2005, International Journal of Document Analysis and Recognition (IJDAR).

[12]  M. Pechwitz,et al.  IFN/ENIT: database of handwritten arabic words , 2002 .

[13]  Luc Vincent,et al.  Google Book Search: Document Understanding on a Massive Scale , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[14]  Horst Bunke,et al.  Handbook of Character Recognition and Document Image Analysis , 1997 .

[15]  Thomas M. Breuel,et al.  Document cleanup using page frame detection , 2008, International Journal of Document Analysis and Recognition (IJDAR).

[16]  Marcel Worring,et al.  The UvA color document dataset , 2004, International Journal of Document Analysis and Recognition (IJDAR).

[17]  Syed Saqib Bukhari,et al.  Adaptive Binarization of Unconstrained Hand-Held Camera-Captured Document Images , 2009, J. Univers. Comput. Sci..

[18]  Rafael Dueire Lins,et al.  A New Method for Shading Removal and Binarization of Documents Acquired with Portable Digital Cameras , 2010 .

[19]  Thomas M. Breuel,et al.  Performance Evaluation and Benchmarking of Six-Page Segmentation Algorithms , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.