Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language

Converting Science, Technology, Engineering, and Mathematics (STEM) documents to formal expressions has a large impact on academic and industrial society. It enables us to construct databases of mathematical knowledge, search for formulae, and develop a system that generates executable codes automatically. However, the conversion is an exceedingly ambitious goal. Mathematical expressions are commonly used in scientific communication in numerous fields such as mathematics and physics, and in many cases, they express key ideas in STEM documents. Despite the importance of mathematical expressions, formulae and texts are complementary to each other, and those in documents cannot be understood independently. Thus, deep synthetic analyses on natural language and mathematical expressions are necessary. To date, a large number of efforts have been made for developing Natural Language Processing (NLP) techniques, including semantic parsing [4], but their targets are mostly ‘general’ texts. Naturally, conventional NLP techniques include only limited features to treat formulae and numerous linguistic phenomena specific to STEM documents [3]. Meanwhile, semantics on mathematical expressions also has been deeply investigated. Such results can be seen in logic theories, MathML specification [1], etc. However, there is a large space between formal expressions such as first-order logic and actual formulae in natural language texts.