Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context

Mathematical formulae represent complex semantic information in a concise form. Especially in Science, Technology, Engineering, and Mathematics, mathematical formulae are crucial to communicate information, e.g., in scientific papers, and to perform computations using computer algebra systems. Enabling computers to access the information encoded in mathematical formulae requires machine-readable formats that can represent both the presentation and content, i.e., the semantics, of formulae. Exchanging such information between systems additionally requires conversion methods for mathematical representation formats. We analyze how the semantic enrichment of formulae improves the format conversion process and show that considering the textual context of formulae reduces the error rate of such conversions. Our main contributions are: (1) providing an openly available benchmark dataset for the mathematical format conversion task consisting of a newly created test collection, an extensive, manually curated gold standard and task-specific evaluation metrics; (2) performing a quantitative evaluation of state-of-the-art tools for mathematical format conversions; (3) presenting a new approach that considers the textual context of formulae to reduce the error rate for mathematical format conversions. Our benchmark dataset facilitates future research on mathematical format conversions as well as research on many problems in mathematical information retrieval. Because we annotated and linked all components of formulae, e.g., identifiers, operators and other entities, to Wikidata entries, the gold standard can, for instance, be used to train methods for formula concept discovery and recognition. Such methods can then be applied to improve mathematical information retrieval systems, e.g., for semantic formula search, recommendation of mathematical content, or detection of mathematical plagiarism.

[1]  Moritz Schubotz,et al.  Mathoid: Robust, Scalable, Fast and Accessible Math Rendering for Wikipedia , 2014, CICM.

[2]  Volker Markl,et al.  Evaluation of Similarity-Measure Factors for Formulae Based on the NTCIR-11 Math Task , 2014, NTCIR.

[3]  Nikolaus Augsten,et al.  RTED: A Robust Algorithm for the Tree Edit Distance , 2011, Proc. VLDB Endow..

[4]  Moritz Schubotz,et al.  Analyzing Mathematical Content to Detect Academic Plagiarism , 2017, CIKM.

[5]  Dimitar Misev,et al.  MathML-aware Article Conversion from LaTeX , 2009 .

[6]  Marjorie A. McClain,et al.  Digital Repository of Mathematical Formulae , 2014, CICM.

[7]  T. L. H.,et al.  A History of Mathematical Notations , 1929, Nature.

[8]  Matthew England,et al.  Branch cuts in maple 17 , 2014, ACCA.

[9]  Stephen M. Watt,et al.  The Global Digital Mathematics Library and the International Mathematical Knowledge Trust , 2017, CICM.

[10]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[11]  Moritz Schubotz,et al.  Augmenting Mathematical Formulae for More Effective Querying & Efficient Presentation , 2017 .

[12]  Moritz Schubotz,et al.  Evaluating and Improving the Extraction of Mathematical Identifier Definitions , 2017, CLEF.

[13]  Bruce R. Miller,et al.  Technical Aspects of the Digital Library of Mathematical Functions , 2003, Annals of Mathematics and Artificial Intelligence.

[14]  Michael Kohlhase,et al.  The LaTeXML Daemon: Editable Math on the Collaborative Web , 2011, LWA.

[15]  Volker Markl,et al.  Semantification of Identifiers in Mathematics for Better Math Information Retrieval , 2016, SIGIR.

[16]  Marjorie A. McClain,et al.  Growing the Digital Repository of Mathematical Formulae with Generic Sources , 2015, CICM.

[17]  Minh-Quoc Nghiem,et al.  Automatic Approach to Understanding Mathematical Expressions Using MathML Parallel Markup Corpora (人工知能学会全国大会(第26回)文化,科学技術と未来) -- (International Organized Session「Application Oriented Principles of Machine Learning and Data Mining」) , 2012 .

[18]  Luca Padovani,et al.  On the Roles of LATEX and MathML in Encoding and Processing Mathematical Expressions , 2003, MKM.

[19]  Iadh Ounis,et al.  NTCIR-11 Math-2 Task Overview , 2014, NTCIR.

[20]  Michael Kohlhase,et al.  Interoperability in the OpenDreamKit Project: The Math-in-the-Middle Approach , 2016, CICM.

[21]  Moritz Schubotz,et al.  VMEXT: A Visualization Tool for Mathematical Expression Trees , 2017, CICM.

[22]  Volker Markl,et al.  Challenges of Mathematical Information Retrievalin the NTCIR-11 Math Wikipedia Task , 2015, SIGIR.

[23]  Stephen M. Watt,et al.  Exploiting Implicit Mathematical Semantics in Conversion between TEX and MathML , 2002 .

[24]  Kevin Chen,et al.  Semantic Preserving Bijective Mappings of Mathematical Formulae Between Document Preparation Systems and Computer Algebra Systems , 2017, CICM.

[25]  Marjorie A. McClain,et al.  Growing the Digital Repository of Mathematical Formulae with Generic LaTeX Sources , 2015, ArXiv.

[26]  Stephen M. Watt Conserving Implicit Mathematical Semantics in Conversion between TEX and MathML , 2003 .

[27]  Abdou Youssef,et al.  Part-of-Math Tagging and Applications , 2017, CICM.

[28]  Moritz Schubotz,et al.  Getting the units right , 2016, FM4M/MathUI/ThEdu/DP/WIP@CIKM.