Cryptogram decoding for optical character recognition

OCR systems for printed documents typically require large numbers of font styles and character models to work well. When given an unseen font, performance degrades even in the absence of noise. In this paper, we perform OCR in an unsupervised fashion without using any character models by using a cryptogram decoding algorithm. We present results on real and artificial OCR data.

[1]  Matthew B. Blaschko,et al.  Stability of Hausdorff-based Distance Measures , 2004 .

[2]  Xiang Tong,et al.  A Statistical Approach to Automatic OCR Error Correction in Context , 1996, VLC@COLING.

[3]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[4]  Richard M. Schwartz,et al.  Robust language-independent OCR system , 1999, Other Conferences.

[5]  G. Casella,et al.  Explaining the Gibbs Sampler , 1992 .

[6]  Philip Resnik,et al.  OCR error correction using a noisy channel model , 2002 .

[7]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[8]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[9]  Matthew B. Blaschko,et al.  Combining Local and Global Image Features for Object Class Recognition , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Workshops.

[10]  Stephen V. Rice,et al.  Software tools and test data for research and testing of page-reading OCR systems , 2005, IS&T/SPIE Electronic Imaging.

[11]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[12]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[13]  András Kornai,et al.  An experimental HMM-based postal OCR system , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Tin Kam Ho,et al.  OCR with no shape training , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[15]  Thomas M. Breuel,et al.  Classification by probabilistic clustering , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[16]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[17]  Eric Lecolinet,et al.  A Survey of Methods and Strategies in Character Segmentation , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[19]  Horst Bunke,et al.  Handwritten sentence recognition , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[20]  Shou-De Lin,et al.  Discovering the linear writing order of a two-dimensional ancient hieroglyphic script , 2006, Artif. Intell..

[21]  Erik G. Learned-Miller,et al.  Improving Recognition of Novel Input with Similarity , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[22]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[23]  Deepak Bagai,et al.  A new algorithm for skew detection and correction , 2004, Pattern Recognit. Lett..

[24]  David A. Forsyth,et al.  Searching for Character Models , 2005, NIPS.

[25]  Kevin Laven,et al.  A statistical learning approach to document image analysis , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[26]  Tin Kam Ho,et al.  Identi cation of Case , Digits and Special Symbols Using a Context Window , 2001 .

[27]  George Nagy,et al.  At the frontiers of OCR , 1992, Proc. IEEE.

[28]  George W. Hart To decode short cryptograms , 1994, CACM.

[29]  Horst Bunke,et al.  Off-line cursive handwriting recognition using hidden markov models , 1995, Pattern Recognit..

[30]  Premkumar Natarajan,et al.  The BBN Byblos Pashto OCR system , 2004, HDP '04.

[31]  Flávio Bortolozzi,et al.  A two-stage HMM-based system for recognizing handwritten numeral strings , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[32]  Chi Fang,et al.  Modified character-level deciphering algorithm for OCR in degraded documents , 1995, Electronic Imaging.

[33]  A. Britto,et al.  Recent Advances in Handwriting Recognition , 2005 .

[34]  Dar-Shyang Lee,et al.  Substitution Deciphering Based on HMMs with Applications to Compressed Document Processing , 2002, IEEE Trans. Pattern Anal. Mach. Intell..