Re-typograph phase I: a proof-of-concept for typeface parameter extraction from historical documents

This paper reports on the first phase of an attempt to create a full retro-engineering pipeline that aims to construct a complete set of coherent typographic parameters defining the typefaces used in a printed homogenous text. It should be stressed that this process cannot reasonably be expected to be fully automatic and that it is designed to include human interaction. Although font design is governed by a set of quite robust and formal geometric rulesets, it still heavily relies on subjective human interpretation. Furthermore, different parameters, applied to the generic rulesets may actually result in quite similar and visually difficult to distinguish typefaces, making the retro-engineering an inverse problem that is ill conditioned once shape distortions (related to the printing and/or scanning process) come into play. This work is the first phase of a long iterative process, in which we will progressively study and assess the techniques from the state-of-the-art that are most suited to our problem and investigate new directions when they prove to not quite adequate. As a first step, this is more of a feasibility proof-of-concept, that will allow us to clearly pinpoint the items that will require more in-depth research over the next iterations.

[1]  J. S. Kim,et al.  Identification of Font Styles and Typefaces in Printed Korean Documents , 2003, ICADL.

[2]  Tieniu Tan,et al.  Font Recognition Based on Global Texture Analysis , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Donald E. Knuth,et al.  The Metafont book , 1989 .

[4]  Muhammad Sarfraz,et al.  An automatic algorithm for approximating boundary of bitmap characters , 2004, Future Gener. Comput. Syst..

[5]  Jean-Yves Ramel,et al.  AGORA: the interactive document image analysis tool of the BVH project , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[6]  Ariel Shamir,et al.  Extraction of Typographic Elements from Outline Representations of Fonts , 1996, Comput. Graph. Forum.

[7]  Theodosios Pavlidis,et al.  On the Recognition of Printed Characters of Any Font and Size , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  David Doermann,et al.  Handbook of Document Image Processing and Recognition , 2014, Springer London.

[9]  Hsi-Jian Lee,et al.  A Bezier curve-based approach to shape description for Chinese calligraphy characters , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[10]  Frank Lebourgeois,et al.  DEBORA: Digital AccEss to BOoks of the RenAissance , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[11]  Gabriella Sanniti di Baja,et al.  Skeletonization algorithm running on path-based distance maps , 1996, Image Vis. Comput..

[12]  Jean-Yves Ramel,et al.  Word Retrieval in Historical Document Using Character-Primitives , 2011, 2011 International Conference on Document Analysis and Recognition.

[13]  Roger D. Hersch,et al.  Next generation typeface representations: revisiting parametric fonts , 2010, DocEng '10.

[14]  Itshak Herz Coherent processing of typographic shapes , 1997 .

[15]  Elisa H. Barney Smith,et al.  Statistical image differences, degradation features, and character distance metrics , 2003, Document Analysis and Recognition.

[16]  Richard Zanibbi,et al.  Collecting historical font metrics from Google Books , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[17]  Rolf Ingold,et al.  Optical Font Recognition Using Typographical Features , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Yannis Haralambous,et al.  Fonts & Encodings , 2007 .

[19]  Huaigu Cao,et al.  Machine-Printed Character Recognition , 2014, Handbook of Document Image Processing and Recognition.

[20]  Kin-Man Lam,et al.  Extraction of the Euclidean skeleton based on a connectivity criterion , 2003, Pattern Recognit..

[21]  P. Selinger Potrace : a polygon-based tracing algorithm , 2003 .

[22]  Jérôme Darbon,et al.  Enhancement of historical printed document images by combining Total Variation regularization and Non-local Means filtering , 2011, Image Vis. Comput..

[23]  Jean-Yves Ramel,et al.  Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis , 2013, Lit. Linguistic Comput..