A Simple Equation Region Detector for Printed Document Images in Tesseract

Detecting equation regions from scanned books has received attention in the document image research community in the past few years. Compared with regular text blocks, equation regions have more complicated layouts so we can not simply use text lines to model them. On the other hand, these regions consist of text symbols that can be reflowed, so that the OCR engines should parse them instead of rasterizing them like image regions. In this paper, we present an equation detector with two major contributions: (i) it is built on a simple algorithm that uses the density of special symbols, such that no additional classifier is required, (ii) it has been built into the open source Tesseract that can be accessed and used by the OCR community. The algorithm is tested on the Google Books database with 1534 entries sampled from books/magazines/newspapers of over thirty languages. And we show that Tesseract performance is improved after enabling the detector.

[1]  Masayuki Okamoto,et al.  Embedding a Mathematical OCR Module into OCRopus , 2011, 2011 International Conference on Document Analysis and Recognition.

[2]  P. A. Chou,et al.  Recognition of Equations Using a Two-Dimensional Stochastic Context-Free Grammar , 1989, Other Conferences.

[3]  Volker Sorge,et al.  A Linear Grammar Approach to Mathematical Formula Recognition from PDF , 2009, Calculemus/MKM.

[4]  Berrin A. Yanikoglu,et al.  Probabilistic Mathematical Formula Recognition Using a 2D Context-Free Graph Grammar , 2011, 2011 International Conference on Document Analysis and Recognition.

[5]  Robert M. Haralick,et al.  Understanding mathematical expressions from document images , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[6]  Yan Zhao,et al.  Layout Identification of Printed Mathematical Formula for Recognition , 2010, 2010 2nd International Conference on Information Engineering and Computer Science.

[7]  Liangcai Gao,et al.  Mathematical Formula Identification in PDF Documents , 2011, 2011 International Conference on Document Analysis and Recognition.

[8]  Raymond W. Smith Hybrid Page Layout Analysis via Tab-Stop Detection , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[9]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).