Massive, Free and Reproducible Grountruthed Document Image Databases Generation with DocCreator

Whether your research is focused on image restoration, layout analysis, text-graphic separation, binarization, OCR, etc. you need a groundtruthed database to train your method or to evaluate it. This article presents DocCreator, a multi-platform and open-source software able to create many synthetic image documents with controlled groundtruth. With DocCreator, you can create complete synthetic images choosing the text, font, background and layout to use, add various realistic degradations (bleed-through, light defect, paper deformation, ink degradation, etc.) on original images, or combine both to increase the size of your database. DocCreator comes as an online (easy to test version) and a desktop solution (fast calculation process, and no need to upload copyrighted data). DocCreator is useful for retraining tasks and to know precisely whether your algorithm is robust. It has already been used favorably and could help other DIAR researchers to produce and share groundtruthed databases.

[1]  Luisa Micó,et al.  Music staff removal with supervised pixel classification , 2016, International Journal on Document Analysis and Recognition (IJDAR).

[2]  Henry S. Baird,et al.  Document image defect models , 1995 .

[3]  David S. Doermann,et al.  Geometric Rectification of Camera-Captured Document Images , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Fei Yin,et al.  Transcript mapping for handwritten Chinese documents by integrating character recognition model and geometric context , 2013, Pattern Recognit..

[5]  Alicia Fornés,et al.  An Interactive Transcription System of Census Records Using Word-Spotting Based Information Transfer , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[6]  Muriel Visani,et al.  Semi-synthetic Document Image Generation Using Texture Mapping on Scanned 3D Document Shapes , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[7]  Dov Dori,et al.  A line drawings degradation model for performance characterization , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[8]  Bruno Vallet,et al.  MOTION BLUR DETECTION IN AERIAL IMAGES SHOT WITH CHANNEL-DEPENDENT EXPOSURE TIME , 2010 .

[9]  Jean-Yves Ramel,et al.  User-driven page layout analysis of historical printed books , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[10]  Rémy Mullot,et al.  Old document image segmentation using the autocorrelation function and multiresolution analysis , 2013, Electronic Imaging.

[11]  Nina Sumiko Tomita Hirata,et al.  A Machine Learning Based Method for Staff Removal , 2014, 2014 22nd International Conference on Pattern Recognition.

[12]  Gregory Shakhnarovich,et al.  Learning task-specific similarity , 2005 .

[13]  Muriel Visani,et al.  A character degradation model for grayscale ancient document images , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[14]  Tapas Kanungo,et al.  The architecture of TrueViz: a groundTRUth/metadata editing and VIsualiZing ToolKit , 2003, Pattern Recognit..

[15]  Zhao Jiuzhou,et al.  Creation of Synthetic Chart Image Database with Ground Truth , 2006 .

[16]  Sherif M. Yacoub,et al.  PerfectDoc: a ground truthing environment for complex documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[17]  Luc Vincent,et al.  Pink Panther: A Complete Environment For Ground-Truthing And Benchmarking Document Page Segmentation , 1998, Pattern Recognit..

[18]  Jing Lin,et al.  PixLabeler: User Interface for Pixel-Level Labeling of Elements in Document Images , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[19]  Eiki Ishidera,et al.  A study on top-down word image generation for handwritten word recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[20]  Alejandro Héctor Toselli,et al.  Ground-Truth Production in the Transcriptorium Project , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[21]  Marcus Liwicki,et al.  Gradient-domain degradations for improving historical documents images layout analysis , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[22]  Tony P. Pridmore,et al.  Generation of synthetic documents for performance evaluation of symbol recognition & spotting systems , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[23]  R. F. Moghaddam,et al.  Low quality document image modeling and enhancement , 2009, International Journal of Document Analysis and Recognition (IJDAR).

[24]  Apostolos Antonacopoulos,et al.  Efficient OCR Training Data Generation with Aletheia * , 2014 .

[25]  Daniel P. Lopresti,et al.  An Open Architecture for End-to-End Document Analysis Benchmarking , 2011, 2011 International Conference on Document Analysis and Recognition.