Unsupervised font reconstruction based on token co-occurrence

High quality conversions of scanned documents into PDF usually either rely on full OCR or token compression. This paper describes an approach intermediate between those two: it is based on token clustering, but additionally groups tokens into candidate fonts. Our approach has the potential of yielding OCR-like PDFs when the inputs are high quality and degrading to token based compression when the font analysis fails, while preserving full visual fidelity. Our approach is based on an unsupervised algorithm for grouping tokens into candidate fonts. The algorithm constructs a graph based on token proximity and derives token groups by partitioning this graph. In initial experiments on scanned 300 dpi pages containing multiple fonts, this technique reconstructs candidate fonts with 100% accuracy.

[1]  Yann LeCun,et al.  DjVu: analyzing and compressing scanned documents for Internet distribution , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[2]  Tieniu Tan,et al.  Font Recognition Based on Global Texture Analysis , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Adam Langley,et al.  Google Books: making the public domain universally accessible , 2007, Electronic Imaging.

[5]  Rolf Ingold,et al.  Optical Font Recognition Using Typographical Features , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Thomas Breuel,et al.  Recent progress on the OCRopus OCR system , 2009, MOCR '09.

[7]  Reiner Lenz,et al.  FyFont: Find-your-Font in Large Font Databases , 2007, SCIA.

[8]  George Nagy,et al.  Prototype Extraction and Adaptive OCR , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Jonathan J. Hull,et al.  Font and Function Word Identification in Document Recognition , 1996, Comput. Vis. Image Underst..

[10]  Ming Xu,et al.  Mixed raster content (MRC) model for compound image compression , 1998, Electronic Imaging.

[11]  Thomas M. Breuel,et al.  The OCRopus open source OCR system , 2008, Electronic Imaging.

[12]  Juan Villegas-Cortez,et al.  Unsupervised Font Clustering Using Stochastic Versio of the EM Algorithm and Global Texture Analysis , 2004, CIARP.

[13]  Thomas M. Breuel,et al.  Segmentation of handprinted letter strings using a dynamic programming algorithm , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.