Identification of embedded mathematical formulas in PDF documents using SVM

With the tremendous popularity of PDF format, recognizing mathematical formulas in PDF documents becomes a new and important problem in document analysis field. In this paper, we present a method of embedded mathematical formula identification in PDF documents, based on Support Vector Machine (SVM). The method first segments text lines into words, and then classifies each word into two classes, namely formula or ordinary text. Various features of embedded formulas, including geometric layout, character and context content, are utilized to build a robust and adaptable SVM classifier. Embedded formulas are then extracted through merging the words labeled as formulas. Experimental results show good performance of the proposed method. Furthermore, the method has been successfully incorporated into a commercial software package for large-scale e-Book production.

[1]  B. Krauskopf,et al.  Proc of SPIE , 2003 .

[2]  Minghu Ha,et al.  An Improved Algorithm of Optical Formula Extraction with Fuzzy Classification , 2008, Int. J. Pattern Recognit. Artif. Intell..

[3]  Amit Kumar Das,et al.  Automated segmentation of math-zones from document images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[4]  Ching Y. Suen,et al.  Word segmentation of printed text lines based on gap clustering and special symbol detection , 2002, Object recognition supported by user interaction for service robots.

[5]  Bidyut B. Chaudhuri,et al.  Identification of embedded mathematical expressions in scanned documents , 2004, ICPR 2004.

[6]  Hsi-Jian Lee,et al.  Design of a mathematical expression understanding system , 1997, Pattern Recognit. Lett..

[7]  Liangcai Gao,et al.  Mathematical Formula Identification in PDF Documents , 2011, 2011 International Conference on Document Analysis and Recognition.

[8]  Mohamed Ben Ahmed,et al.  Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context , 2001, International Journal on Document Analysis and Recognition.

[9]  B. B. Chaudhuri,et al.  A syntactic approach for processing mathematical expressions in printed documents , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[10]  Jianming Jin,et al.  Mathematical formulas extraction , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[11]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[12]  Utpal Garain,et al.  Identification of Mathematical Expressions in Document Images , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[13]  Sonia Garcia-Salicetti,et al.  A hierarchical and recursive model of mathematical expressions for automatic reading of mathematical documents , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[14]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.