Ground truth creation for handwriting recognition in historical documents

Handwriting recognition in historical documents is vital for the creation of digital libraries. The creation of readily available ground truth data plays a central role for the development of new recognition technologies. For historical documents, ground truth creation is more difficult and time-consuming when compared with modern documents. In this paper, we present a semi-automatic ground truth creation proceeding for historical documents that takes into account noisy background and transcription alignment. The proposed ground truth creation is demonstrated for the IAM Historical Handwriting Database (IAM-HistDB) that is currently under construction and will include several hundred Old German manuscripts. With a small set of algorithmic tools and few manual interactions, it is shown how laypersons can efficiently create a ground truth for handwriting recognition.

[1]  Marcus Liwicki,et al.  Combining Alignment Results for Historical Handwritten Document Analysis , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[2]  Stefan Knerr,et al.  The IRESTE On/Off (IRONOFF) dual handwriting database , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[3]  R. Manmatha,et al.  Word spotting for historical documents , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[4]  Ophir Frieder,et al.  Interactive degraded document enhancement and ground truth generation , 2008, Electronic Imaging.

[5]  Marcus Liwicki,et al.  IAM-OnDB - an on-line English sentence database acquired from handwritten text on a whiteboard , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[6]  Horst Bunke,et al.  Using a Statistical Language Model to Improve the Performance of an HMM-Based Cursive Handwriting Recognition System , 2001, Int. J. Pattern Recognit. Artif. Intell..

[7]  Sargur N. Srihari,et al.  A system to read names and addresses on tax forms , 1996 .

[8]  R. Manmatha,et al.  A Hidden Markov Model for Alphabet-Soup Word Recognition , 2008 .

[9]  Horst Bunke,et al.  Automatic segmentation of the IAM off-line database for handwritten English text , 2002, Object recognition supported by user interaction for service robots.

[10]  Horst Bunke,et al.  Automatic bankcheck processing , 1997 .

[11]  Daniel P. Lopresti,et al.  Interactive document processing and digital libraries , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[12]  Masaki Nakagawa,et al.  Collection of on-line handwritten Japanese character pattern databases and their analyses , 2004, Document Analysis and Recognition.

[13]  R. Manmatha,et al.  A scale space approach for automatically segmenting words from historical handwritten documents , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Laurence Likforman-Sulem,et al.  Text line segmentation of historical documents: a survey , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[15]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[16]  M. Pechwitz,et al.  IFN/ENIT: database of handwritten arabic words , 2002 .

[17]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[18]  Marcus Liwicki,et al.  Language Model Integration for the Recognition of Handwritten Medieval Documents , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[19]  Apostolos Antonacopoulos,et al.  Special issue on the analysis of historical documents , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[20]  Ioannis Pratikakis,et al.  An old greek handwritten OCR system based on an efficient segmentation-free approach , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[21]  Volkmar Frinken,et al.  Automatic Transcription of Handwritten Medieval Documents , 2009, 2009 15th International Conference on Virtual Systems and Multimedia.

[22]  Yuzuru Tanaka,et al.  Slit Style HOG Feature for Document Image Word Spotting , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[23]  Bidyut Baran Chaudhuri,et al.  Handwritten Numeral Databases of Indian Scripts and Multistage Recognition of Mixed Numerals , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Marcus Liwicki,et al.  On-Line Handwritten Text Line Detection Using Dynamic Programming , 2007 .

[25]  Jonathan J. Hull,et al.  A Database for Handwritten Text Recognition Research , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Isabelle Guyon,et al.  UNIPEN project of on-line data exchange and recognizer benchmarks , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).