Automatic Word Ground Truth Generation for Camera Captured Documents (パターン認識・メディア理解)

A database for camera captured documents is useful to train OCRs to obtain better performance. However, no dataset exists for camera captured documents because it is very laborious and costly to build these datasets manually. In this paper, a fully automatic approach allowing building the very large scale (i.e., millions of images) labeled camera captured documents dataset is proposed. The proposed approach does not require any human intervention in labeling. Evaluation of samples generated by the proposed approach shows that more than 97% of the images are correctly labeled. Novelty of the proposed approach lies in the use of document image retrieval for automatic labeling, especially for camera captured documents, which contain different distortions specific to camera, e.g., blur, perspective distortion, etc.

[1]  Thomas M. Breuel A practical, globally optimal algorithm for geometric matching under uncertainty , 2001, Electron. Notes Theor. Comput. Sci..

[2]  Masakazu Iwamura,et al.  Generative Learning for Character Recognition of Uneven Lighting , 2008 .

[3]  Henry S. Baird,et al.  The State of the Art of Document Image Degradation Modelling , 2007 .

[4]  Robert M. Haralick,et al.  Automatic generation of character groundtruth for scanned documents: a closed-loop approach , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[5]  Gang Zi GROUNDTRUTH GENERATION AND DOCUMENT IMAGE DEGRADATION , 2005 .

[6]  Thomas M. Breuel,et al.  Automated OCR Ground Truth Generation , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[7]  Tapas Kanungo,et al.  Attributed point matching for automatic groundtruth generation , 2002, International Journal on Document Analysis and Recognition.

[8]  Masakazu Iwamura,et al.  Real-Time Document Image Retrieval for a 10 Million Pages Database with a Memory Efficient and Stability Improved LLAH , 2011, 2011 International Conference on Document Analysis and Recognition.

[9]  Sahin Albayrak,et al.  Automated Ground Truth Data Generation for Newspaper Document Images , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[10]  Ichiro Ide,et al.  Recognition of low-resolution characters by a generative learning method , 2005 .