Unsupervised Transcription of Historical Documents

We present a generative probabilistic model, inspired by historical printing processes, for transcribing images of documents from the printing press era. By jointly modeling the text of the document and the noisy (but regular) process of rendering glyphs, our unsupervised system is able to decipher font structure and more accurately transcribe images into text. Overall, our system substantially outperforms state-of-the-art solutions for this task, achieving a 31% relative reduction in word error rate over the leading commercial system for historical transcription, and a 47% relative reduction over Tesseract, Google’s open source OCR system.

[1]  Vladimir Kluzner,et al.  Word-Based Adaptive OCR for Historical Books , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[2]  Erik G. Learned-Miller,et al.  Learning on the Fly: Font-Free Approaches to Difficult OCR Problems , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[3]  Erik G. Learned-Miller,et al.  Improving state-of-the-art OCR through high-precision document-specific modeling , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[5]  Regina Barzilay,et al.  A Statistical Model for Lost Language Decipherment , 2010, ACL.

[6]  Mark Liberman,et al.  Obituary: Fred Jelinek , 2010, CL.

[7]  Kevin Knight,et al.  Attacking Decipherment Problems Optimally with Low-Order N-gram Models , 2008, EMNLP.

[8]  Stavros J. Perantonis,et al.  A Complete Optical Character Recognition Methodology for Historical Documents , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[9]  Dan Klein,et al.  Coarse-to-Fine Syntactic Machine Translation using Language Projections , 2008, EMNLP.

[10]  Kris Popat,et al.  N-gram language models for document image decoding , 2001, IS&T/SPIE Electronic Imaging.

[11]  Vladimir Kluzner,et al.  Hybrid Approach to Adaptive OCR for Historical Books , 2011, 2011 International Conference on Document Analysis and Recognition.

[12]  John DeNero,et al.  Painless Unsupervised Learning with Features , 2010, NAACL.

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[15]  Andrew McCallum,et al.  Cryptogram decoding for optical character recognition , 2006 .

[16]  Hermann Ney,et al.  The Alignment Template Approach to Statistical Machine Translation , 2004, CL.

[17]  Rose Holley Trove: Innovation in Access to Information in Australia , 2010 .

[18]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[19]  Gary E. Kopec,et al.  Document-specific character template estimation , 1996, Electronic imaging.

[20]  Philipp Koehn,et al.  Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models , 2004, AMTA.

[21]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[22]  Robert Shoemaker Digital London: Creating a searchable web of interlinked sources on eighteenth century London , 2005, Program.

[23]  Kenning Arlitsch,et al.  Microfilm, Paper, and OCR: Issues in Newspaper Digitization. The Utah Digital Newspapers Program , 2004 .

[24]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[25]  William J. Byrne,et al.  A Generative Probabilistic OCR Model for NLP Applications , 2003, NAACL.

[26]  Kathryn B. Taylor,et al.  Machine Translation: From Real Users to Research , 2004, Lecture Notes in Computer Science.

[27]  Dan Klein,et al.  Simple Effective Decipherment via Combinatorial Optimization , 2011, EMNLP.

[28]  Tin Kam Ho,et al.  OCR with no shape training , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[29]  Kevin Knight,et al.  Bayesian Inference for Zodiac and Other Homophonic Ciphers , 2011, ACL.