Textual and Visual Characteristics of Mathematical Expressions in Scholar Documents

Mathematical expressions (ME) are widely used in scholar documents. In this paper we analyze characteristics of textual and visual MEs characteristics for the image-to-LaTeX translation task. While there are open data-sets of LaTeX files with MEs included it is very complicated to extract these MEs from a document and to compile the list of MEs. Therefore we release a corpus of open-access scholar documents with PDF and JATS-XML parallel files. The MEs in these documents are LaTeX encoded and are document independent. The data contains more than 1.2 million distinct annotated formulae and more than 80 million raw tokens of LaTeX MEs in more than 8 thousand documents. While the variety of textual lengths and visual sizes of MEs are not well defined we found that the task of analyzing MEs in scholar documents can be reduced to the subtask of a particular text length, image width and height bounds, and display MEs can be processed as arrays of partial MEs.

[1]  Xing Wang,et al.  A content-constrained spatial (CCS) model for layout analysis of mathematical expressions , 2017, 2017 Twelfth International Conference on Digital Information Management (ICDIM).

[2]  Petr Sojka,et al.  Web Interface and Collection for Mathematical Retrieval :WebMIaS and MREC , 2011 .

[3]  Alexander M. Rush,et al.  What You Get Is What You See: A Visual Markup Decompiler , 2016, ArXiv.

[4]  Liangcai Gao,et al.  Mathematical Formula Identification in PDF Documents , 2011, 2011 International Conference on Document Analysis and Recognition.

[5]  Waleed Ammar,et al.  Extracting Scientific Figures with Distantly Supervised Neural Networks , 2018, JCDL.

[6]  Alexander M. Rush,et al.  Image-to-Markup Generation with Coarse-to-Fine Attention , 2016, ICML.

[7]  Xing Wang,et al.  A Font Setting Based Bayesian Model to Extract Mathematical Expression in PDF Files , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).