Tangent-CFT: An Embedding Model for Mathematical Formulas

When searching for mathematical content, accurate measures of formula similarity can help with tasks such as document ranking, query recommendation, and result set clustering. While there have been many attempts at embedding words and graphs, formula embedding is in its early stages. We introduce a new formula embedding model that we use with two hierarchical representations, (1) Symbol Layout Trees (SLTs) for appearance, and (2) Operator Trees (OPTs) for mathematical content. Following the approach of graph embeddings such as DeepWalk, we generate tuples representing paths between pairs of symbols depth-first, embed tuples using the fastText n-gram embedding model, and then represent an SLT or OPT by its average tuple embedding vector. We then combine SLT and OPT embeddings, leading to state-of-the-art results for the NTCIR-12 formula retrieval task. Our fine-grained holistic vector representations allow us to retrieve many more partially similar formulas than methods using structural matching in trees. Combining our embedding model with structural matching in the Approach0 formula search engine produces state-of-the-art results for both fully and partially relevant results on the NTCIR-12 benchmark. Source code for our system is publicly available.

[1]  Kenny Davila,et al.  Layout and Semantics: Combining Representations for Mathematical Formula Search , 2017, SIGIR.

[2]  Petr Sojka,et al.  The art of mathematics retrieval , 2011, DocEng '11.

[3]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[4]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[5]  Bhaskar Mitra,et al.  Query Auto-Completion for Rare Prefixes , 2015, CIKM.

[6]  Douglas W. Oard,et al.  Characterizing Searches for Mathematical Concepts , 2019, 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[7]  Dallas J. Fraser,et al.  Choosing Math Features for BM25 Ranking with Tangent-L , 2018, DocEng.

[8]  Richard Zanibbi,et al.  Recognition and retrieval of mathematical expressions , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[9]  Iadh Ounis,et al.  NTCIR-11 Math-2 Task Overview , 2014, NTCIR.

[10]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[11]  Frank Wm. Tompa,et al.  Multi-Stage Math Formula Search: Using Appearance-Based Similarity Metrics at Scale , 2016, SIGIR.

[12]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[13]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[14]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[15]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[16]  Yue Yin,et al.  Preliminary Exploration of Formula Embedding for Mathematical Information Retrieval: can mathematical formulae be embedded like a natural language? , 2017, ArXiv.

[17]  Giovanni Yoko Kristianto,et al.  MCAT Math Retrieval System for NTCIR-12 MathIR Task , 2016, NTCIR.

[18]  Arun Agarwal,et al.  A Structure Based Approach for Mathematical Expression Retrieval , 2012, MIWAI.

[19]  Wei Zhong,et al.  Structural Similarity Search for Formulas Using Leaf-Root Paths in Operator Subtrees , 2019, ECIR.

[20]  Abhishek Gupta,et al.  A Document Retrieval System for Math Queries , 2016, NTCIR.

[21]  Iadh Ounis,et al.  NTCIR-12 MathIR Task Overview , 2016, NTCIR.