论文信息 - Embedding Formulae and Text for Improved Math Retrieval

Embedding Formulae and Text for Improved Math Retrieval

Large data collections containing millions of math formulae are available online. Retrieving math expressions from these collections is challenging. The structural complexity of formulae requires specialized processing. When searching for mathematical content, accurate measures of formula similarity can help with tasks such as document ranking, query recommendation, and result set clustering. While there have been many attempts at embedding words and graphs, formula embedding is still in its early stages. This research aims to introduce an embedding model for mathematical formulae and accompanying text that can be used in math information retrieval. For that, first embedding models for isolated formulae are introduced, using intrinsic measures to study the effectiveness and efficiency of retrieval using those embeddings. Those results support the second goal of this research, which is to develop joint embedding models for formulae and text that can support the full range of content encountered in math retrieval. This can be seen as a special case of multimodal embedding, thus potentially benefiting from related research that jointly models other cases in which text and structured representations are co-present, such as chemistry. I summarize the research questions as follows: RQ1: How can we effectively provide an embedding model for isolated mathematical formulae? RQ2: How should the joint embedding of text and formulae be done? RQ3: How can evaluation of math search be grounded in a representative task? For RQ1, I propose to first study simple models that walk the tree structure to study the effectiveness and efficiency of the formula embedding model and then move to more advanced models. I have introduced Tangent-CFT [2] model. As my next step for formula embedding, I plan to look at deep neural network models that have been applied for graph embedding. After studying an embedding model for isolated formulae, in RQ2 I plan to focus on making use of the surrounding text of formulae. I will consider four possible approaches to constructing a joint embedding model: Linearizing the tree structure of formulae to sequences and then applying a single sequence embedding model to the text and the linearized formula, similar to [1], Forming separate embeddings for text and formulae, then unifying the two embedding spaces using seed alignments obtained either through supervision or using heuristics, or Extracting a tree out of the text and then apply a structure embedding model on both trees, or Combine results from specialized embedding models. For example, if the task is retrieval (ranking), then in the simplest scenario the results can be combined with methods such as Reciprocal Rank Fusion (RRF) or CombMNZ. I would then study how text and formulae embedding models should be combined. One possible solution might be to do retrieval using each of the embeddings and then combine the results. Another approach is to learn a model that provides a unified embedding that captures both formula and text features. Another approach to have a joint embedding model is to convert text to a tree structure. I can then look at this as a tree-to-tree translation problem. For both RQ1 and RQ2, I plan to first study the effectiveness of the proposed embedding in the formula retrieval before proceeding to the text+formula condition. Results will be compared with the best-reported results on the ARQMath [3] question answering task. While part of this research focuses on creating an embedding model for math, I also need a standard evaluation protocol and dataset. In a planned three-year sequence of ARQMath labs, I aim to answer RQ3 and provide high-quality training, devtest, and test sets for math search. Importantly, ARQMath also serves as a platform for operationalizing a repeatable community-consensus definition for relevance in isolated formula search.

Behrooz Mansouri | Behrooz Mansouri

[1] Douglas W. Oard,et al. Overview of ARQMath 2020: CLEF Lab on Answer Retrieval for Questions on Math , 2020, CLEF.

[2] Douglas W. Oard,et al. Tangent-CFT: An Embedding Model for Mathematical Formulas , 2019, ICTIR.