A content-constrained spatial (CCS) model for layout analysis of mathematical expressions

This paper proposes a content-constrained spatial (CCS) model to recover the mathematical layout (M-layout, or MLme) of an mathematical expression (ME) from its font setting layout (F-layout, or FLme). The M-layout can be used for content analysis applications such as ME based indexing and retrieval of documents. The first of the two-step process is to divide a compounded ME into blocks based on explicit mathematical structure primitives such as fraction lines, radical signs, fence, etc. Subscripts and superscripts within a block are resolved by probabilistic inference of their likelihood based on a global optimization model. The dual peak distributions of the features to capture the relative position between sibling blocks as super/subscript call for a sampling based non-parametric probability distribution estimation method to resolve their ambiguity. The notion of spatial constraint indicators is proposed to reduce the search space while improving the prediction performance. The proposed scheme is tested using the InftyCDB data set to achieve the F1 score of 0.98.

[1]  C. Faure,et al.  Structural analysis of handwritten mathematical expressions , 1988, [1988 Proceedings] 9th International Conference on Pattern Recognition.

[2]  Fotini Simistira,et al.  Recognition of online handwritten mathematical formulas using probabilistic SVMs and stochastic context free grammars , 2015, Pattern Recognit. Lett..

[3]  Richard Zanibbi,et al.  A shape-based layout descriptor for classifying spatial relationships in handwritten math , 2013, ACM Symposium on Document Engineering.

[4]  Masakazu Suzuki,et al.  Statistical Classification of Spatial Relationships among Mathematical Symbols , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[5]  Tobias Oetiker,et al.  The not so short introduction to LATEX 2ε: or LATEX 2ε in 95 minutes , 2001 .

[6]  Xing Wang,et al.  A Font Setting Based Bayesian Model to Extract Mathematical Expression in PDF Files , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[7]  Richard Zanibbi,et al.  Processing Mathematical Notation , 2014, Handbook of Document Image Processing and Recognition.

[8]  Ling Ouyang,et al.  A symbol layout classification for mathematical formula using layout context , 2009 .

[9]  S. Sheather Density Estimation , 2004 .

[10]  Dit-Yan Yeung,et al.  Mathematical expression recognition: a survey , 2000, International Journal on Document Analysis and Recognition.

[11]  Volker Sorge,et al.  Towards a Parser for Mathematical Formula Recognition , 2006, MKM.

[12]  Masakazu Suzuki,et al.  Identifying Subscripts and Superscripts in Mathematical Documents , 2008, Math. Comput. Sci..