Optical Character Recognition and Parsing of Typeset Mathematics1

Abstract There is a wealth of mathematical knowledge that could be potentially very useful in many computational applications, but is not available in electronic form. This knowledge comes in the form of mechanically typeset books and journals going back more than 100 years. Besides these older sources, there are a great many current publications, filled with useful mathematical information, which are difficult if not impossible to obtain in electronic form. Our work intends to encode, for use by computer algebra systems, integral tables and other documents currently available in hardcopy only. Our strategy is to extract character information from these documents, which is then passed to higher-level parsing routines for further extraction of mathematical content (or any other useful two-dimensional semantic content). This information can then be output as, for example, a Lisp or TEX expression. We have also developed routines for rapid access to this information, specifically for finding matches with formulas in a table of integrals. This paper reviews our current efforts and summarizes our results and the problems we have encountered.

[1]  Robert H. Anderson Syntax-directed recognition of hand-printed two-dimensional mathematics , 1967, Symposium on Interactive Systems for Experimental Applied Mathematics.

[2]  William Martin A Fast-Parsing Scheme for Hand-Printed Mathematical Expressions , 1967 .

[3]  Robert H. Anderson Syntax-directed recognition of hand-printed two-dimensional mathematics , 1967, Symposium on Interactive Systems for Experimental Applied Mathematics.

[4]  William A. Martin,et al.  Computer input/output of mathematical expressions , 1971, SYMSAC '71.

[5]  Theodosios Pavlidis,et al.  On the Recognition of Printed Characters of Any Font and Size , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  P. A. Chou,et al.  Recognition of Equations Using a Two-Dimensional Stochastic Context-Free Grammar , 1989, Other Conferences.

[7]  Paul D. Gader Image Algebra and Morphological Image Processing , 1991 .

[8]  Kazuhiko Yamamoto,et al.  Structured Document Image Analysis , 1992, Springer Berlin Heidelberg.

[9]  Mindy Bokser,et al.  Omnidocument technologies , 1992, Proc. IEEE.

[10]  Masayuki Okamoto,et al.  An Experimental Implementation of a Document Recognition System for Papers Containing Mathematical Expressions , 1992 .

[11]  Daniel P. Huttenlocher,et al.  Comparing Images Using the Hausdorff Distance , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Richard J. Fateman,et al.  Optical character recognition for typeset mathematics , 1994, ISSAC '94.

[13]  Gary E. Kopec,et al.  Editing images of text , 1994, CACM.

[14]  Junichi Kanai,et al.  Use of synthesized images to evaluate the performance of optical character recognition devices and algorithms , 1994, Electronic Imaging.

[15]  Richard J. Fateman,et al.  Searching techniques for integral tables , 1995, ISSAC '95.

[16]  Henry S. Baird,et al.  Document image defect models , 1995 .

[17]  Lawrence O'Gorman,et al.  Document Image Analysis , 1996 .

[18]  J. Hull,et al.  Document Recognition IV , 1997 .