PYA0: A Python Toolkit for Accessible Math-Aware Search

Mathematical Information Retrieval (MIR) has been actively studied in recent years and many fruitful results have emerged. Among those, the Approach Zero system is one of the few math-aware search engines that is able to perform substructure matching efficiently. Furthermore, it has been deployed in ARQMath2020, the most recent community-wide MIR evaluation, as a strong baseline due to its empirical effectiveness and ability to handle structured math content. However, in order to implement a retrieval model that handles structured queries efficiently, Approach Zero is written in C from the ground up, requiring special pipelines for processing math content and queries. Thus, the system is not conveniently accessible and reusable to the community as a research tool. In this paper, we present PyA0, an easy-to-use Python toolkit built on Approach Zero that improves its accessibility to researchers. We introduce the toolkit interface and report evaluation results on popular MIR datasets to demonstrate the effectiveness and efficiency of our toolkit. We have made PyA0 source code publicly accessible at https://github.com/approach0/pya0, which includes a link to a notebook demo.

[1]  Iadh Ounis,et al.  NTCIR-12 MathIR Task Overview , 2016, NTCIR.

[2]  Jian Wu,et al.  PSU at CLEF-2020 ARQMath Track: Unsupervised Re-ranking using Pretraining , 2020, CLEF.

[3]  Noriko Kando,et al.  On information retrieval metrics designed for evaluation with incomplete relevance assessments , 2008, Information Retrieval.

[4]  Wei Zhong,et al.  A novel similarity-search method for mathematical content in LaTeX markup and its implementation , 2015 .

[5]  Richard Zanibbi,et al.  Recognition and retrieval of mathematical expressions , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[6]  Giovanni Yoko Kristianto,et al.  MCAT Math Retrieval System for NTCIR-12 MathIR Task , 2016, NTCIR.

[7]  C. L. Giles,et al.  Accelerating Substructure Similarity Search for Formula Retrieval , 2020, ECIR.

[8]  Qi Zhang,et al.  Information Retrieval: 25th China Conference, CCIR 2019, Fuzhou, China, September 20–22, 2019, Proceedings , 2019, CCIR.

[9]  Douglas W. Oard,et al.  Overview of ARQMath 2020: CLEF Lab on Answer Retrieval for Questions on Math , 2020, CLEF.

[10]  Kenny Davila,et al.  Layout and Semantics: Combining Representations for Mathematical Formula Search , 2017, SIGIR.

[11]  George Labahn,et al.  Dowsing for Math Answers with Tangent-L , 2020, CLEF.

[12]  Wei Zhong,et al.  Structural Similarity Search for Formulas Using Leaf-Root Paths in Operator Subtrees , 2019, ECIR.

[13]  Douglas W. Oard,et al.  Tangent-CFT: An Embedding Model for Mathematical Formulas , 2019, ICTIR.