DATA SETS FOR OCR AND DOCUMENT IMAGE UNDERSTANDING RESEARCH

Several significant sets of labeled samples of image data are surveyed that can be used in the development of algorithms for offline and online handwriting recognition as well as for machine printed text recognition. The method used to gather each data set, the numbers of samples they contain, and the associated truth data are discussed. In the domain of offline handwriting, the CEDAR, NIST, and CENPARMI data sets are pre­ sented. These contain primarily isolated digits and alphabetic characters. The UNIPEN data set of online handwriting was collected from a number of independent sources and it contains individual characters as well as handwritten phrases. The University of Wash­ ington document image databases are also discussed. They contain a large number of English and Japanese document images that were selected from a range of publications.

[1]  K. Mohiuddin International Conference On Document Analysis and Recognition , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[2]  I.T. Phillips,et al.  The implementation methodology for a CD-ROM English document database , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[3]  Jonathan J. Hull,et al.  A Database for Handwritten Text Recognition Research , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Horst Bunke,et al.  Handbook of Character Recognition and Document Image Analysis , 1997 .

[5]  R. M. Haralick,et al.  Estimating errors in document databases , 1994 .

[6]  Isabelle Guyon,et al.  UNIPEN project of on-line data exchange and recognizer benchmarks , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[7]  Isabelle Guyon,et al.  What Size Test Set Gives Good Error Rate Estimates? , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  George Nagy,et al.  At the frontiers of OCR , 1992, Proc. IEEE.

[9]  Henry S. Baird,et al.  Document image defect models , 1995 .