DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images

Most digital libraries that provide user-friendly interfaces, enabling quick and intuitive access to their resources, are based on Document Image Analysis and Recognition (DIAR) methods. Such DIAR methods need ground-truthed document images to be evaluated/compared and, in some cases, trained. Especially with the advent of deep learning-based approaches, the required size of annotated document datasets seems to be ever-growing. Manually annotating real documents has many drawbacks, which often leads to small reliably annotated datasets. In order to circumvent those drawbacks and enable the generation of massive ground-truthed data with high variability, we present DocCreator, a multi-platform and open-source software able to create many synthetic image documents with controlled ground truth. DocCreator has been used in various experiments, showing the interest of using such synthetic images to enrich the training stage of DIAR tools.

[1]  Rémy Mullot,et al.  Old document image segmentation using the autocorrelation function and multiresolution analysis , 2013, Electronic Imaging.

[2]  Jean-Philippe Domenger,et al.  Quality evaluation of degraded document images for binarization result prediction , 2014, International Journal on Document Analysis and Recognition (IJDAR).

[3]  Javad Mohammadi,et al.  Novel Approach for Baseline Detection and Text Line Segmentation , 2012 .

[4]  Edouard Geoffrois,et al.  Results of the RIMES Evaluation Campaign for Handwritten Mail Processing , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[5]  Gregory Shakhnarovich,et al.  Learning task-specific similarity , 2005 .

[6]  Thomas Kieninger,et al.  An open approach towards the benchmarking of table structure recognition systems , 2010, DAS '10.

[7]  Ioannis Pratikakis,et al.  ICDAR 2011 Document Image Binarization Contest (DIBCO 2011) , 2011, 2011 International Conference on Document Analysis and Recognition.

[8]  Robert M. Haralick,et al.  Global and local document degradation models , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[9]  Thierry Géraud,et al.  The SCRIBO Module of the Olena Platform: A Free Software Framework for Document Image Analysis , 2011, 2011 International Conference on Document Analysis and Recognition.

[10]  R. F. Moghaddam,et al.  Low quality document image modeling and enhancement , 2009, International Journal of Document Analysis and Recognition (IJDAR).

[11]  Mas Joan,et al.  An Interactive Transcription System of Census Records Using Word-Spotting Based Information Transfer , 2016 .

[12]  Tony P. Pridmore,et al.  Generation of synthetic documents for performance evaluation of symbol recognition & spotting systems , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[13]  Stéphane Canu,et al.  Kernel Approximations for W-Operator Learning , 2016, 2016 29th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI).

[14]  Apostolos Antonacopoulos,et al.  Efficient OCR Training Data Generation with Aletheia * , 2014 .

[15]  Muriel Visani,et al.  Generation of learning samples for historical handwriting recognition using image degradation , 2013, HIP '13.

[16]  Robert Sablatnig,et al.  End-to-End Text Recognition Using Local Ternary Patterns, MSER and Deep Convolutional Nets , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[17]  Luisa Micó,et al.  Music staff removal with supervised pixel classification , 2016, International Journal on Document Analysis and Recognition (IJDAR).

[18]  Alexandru Telea,et al.  An Image Inpainting Technique Based on the Fast Marching Method , 2004, J. Graphics, GPU, & Game Tools.

[19]  Jean-Yves Ramel,et al.  Word Retrieval in Historical Document Using Character-Primitives , 2011, 2011 International Conference on Document Analysis and Recognition.

[20]  Jing Lin,et al.  PixLabeler: User Interface for Pixel-Level Labeling of Elements in Document Images , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[21]  Eiki Ishidera,et al.  A study on top-down word image generation for handwritten word recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[22]  Alejandro Héctor Toselli,et al.  Ground-Truth Production in the Transcriptorium Project , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[23]  Marcus Liwicki,et al.  Gradient-domain degradations for improving historical documents images layout analysis , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[24]  Nicolas Ragot,et al.  OCR Performance Prediction Using a Bag of Allographs and Support Vector Regression , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[25]  David S. Doermann,et al.  Document Image Quality Assessment: A Brief Survey , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[26]  Jorge Calvo-Zaragoza Pattern Recognition for Music Notation , 2016 .

[27]  Apostolos Antonacopoulos,et al.  Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments , 2011, 2011 International Conference on Document Analysis and Recognition.

[28]  Masakazu Suzuki,et al.  Ground-truthed dataset of chemical structure images in Japanese published patent applications , 2010, DAS '10.

[29]  Soo-Hyung Kim,et al.  Staff Line Removal Using Line Adjacency Graph and Staff Line Skeleton for Camera-Based Printed Music Scores , 2014, 2014 22nd International Conference on Pattern Recognition.

[30]  Tamás VARGA,et al.  Effects of Training Set Expansion in Handwriting Recognition Using Synthetic Data , 2003 .

[31]  Thomas M. Breuel,et al.  Two Geometric Algorithms for Layout Analysis , 2002, Document Analysis Systems.

[32]  Angelika Garz,et al.  A User-Centered Segmentation Method for Complex Historical Manuscripts Based on Document Graphs , 2017, IEEE Transactions on Human-Machine Systems.

[33]  Mickaël Coustaty,et al.  ICDAR2015 competition on smartphone document capture and OCR (SmartDoc) , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[34]  Tapas Kanungo,et al.  The architecture of TrueViz: a groundTRUth/metadata editing and VIsualiZing ToolKit , 2003, Pattern Recognit..

[35]  Alfons Juan-Císcar,et al.  The GERMANA Database , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[36]  Zhao Jiuzhou,et al.  Creation of Synthetic Chart Image Database with Ground Truth , 2006 .

[37]  Sherif M. Yacoub,et al.  PerfectDoc: a ground truthing environment for complex documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[38]  Apostolos Antonacopoulos,et al.  Quality Prediction System for Large-Scale Digitisation Workflows , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[39]  Luc Vincent,et al.  Pink Panther: A Complete Environment For Ground-Truthing And Benchmarking Document Page Segmentation , 1998, Pattern Recognit..

[40]  Thierry Géraud,et al.  A morphological method for music score staff removal , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[41]  Henry S. Baird,et al.  Document image defect models , 1995 .

[42]  Xujun Peng,et al.  Document image OCR accuracy prediction via latent Dirichlet allocation , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[43]  Angelika Garz,et al.  Creating Ground Truth for Historical Manuscripts with Document Graphs and Scribbling Interaction , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[44]  Jean-Marc Ogier,et al.  SmartDoc-QA: A dataset for quality assessment of smartphone captured document images - single and multiple distortions , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[45]  Bruno Vallet,et al.  MOTION BLUR DETECTION IN AERIAL IMAGES SHOT WITH CHANNEL-DEPENDENT EXPOSURE TIME , 2010 .

[46]  Jean-Yves Ramel,et al.  User-driven page layout analysis of historical printed books , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[47]  Muriel Visani,et al.  ICDAR 2013 Music Scores Competition: Staff Removal , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[48]  R. Manmatha,et al.  A Fast Alignment Scheme for Automatic OCR Evaluation of Books , 2011, 2011 International Conference on Document Analysis and Recognition.

[49]  Alicia Fornés,et al.  CVC-MUSCIMA: a ground truth of handwritten music score images for writer identification and staff removal , 2012, International Journal on Document Analysis and Recognition (IJDAR).

[50]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[51]  Nina Sumiko Tomita Hirata,et al.  A Machine Learning Based Method for Staff Removal , 2014, 2014 22nd International Conference on Pattern Recognition.

[52]  Rolf Ingold,et al.  Evaluation of SVM, MLP and GMM Classifiers for Layout Analysis of Historical Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[53]  Rémy Mullot,et al.  A structural signature based on texture for digitized historical book page categorization , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[54]  Daniel P. Lopresti,et al.  An Open Architecture for End-to-End Document Analysis Benchmarking , 2011, 2011 International Conference on Document Analysis and Recognition.

[55]  Fei Yin,et al.  Transcript mapping for handwritten Chinese documents by integrating character recognition model and geometric context , 2013, Pattern Recognit..

[56]  Robert M. Haralick,et al.  Automatic generation of character groundtruth for scanned documents: a closed-loop approach , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[57]  Muriel Visani,et al.  Semi-synthetic Document Image Generation Using Texture Mapping on Scanned 3D Document Shapes , 2013, 2013 12th International Conference on Document Analysis and Recognition.