Approach Zero and Anserini at the CLEF-2021 ARQMath Track: Applying Substructure Search and BM25 on Operator Tree Path Tokens

This paper reports on substructure-aware math search system Approach Zero that is applied to our submission for ARQMath lab at CLEF 2021. We have participated in both Task 1 (math ARQ) and Task 2 (formula retrieval) this year. In addition to substructure retrieval, we have added a traditional full-text search pass based on the Anserini toolkit [1]. We use the same path features extracted from Operator Tree (OPT) to index and retrieve math formulas in Anserini, and we interpolate Anserini results with structural results from Approach Zero. Automatic and table-based keyword expansion methods for math formulas have also been explored. Additionally, we report preliminary results from using previous years’ labels and applying learning to rank for our first-stage search results. In this lab, we obtain the most effective search results in Task 2 (formula retrieval) among submissions from 7 participants including the baseline system. Our experiments have also shown a great improvement over the baseline result we produced from previous year.

[1]  Jimmy J. Lin,et al.  PYA0: A Python Toolkit for Accessible Math-Aware Search , 2021, SIGIR.

[2]  Zhi Tang,et al.  MathBERT: A Pre-Trained Model for Mathematical Formula Understanding , 2021, ArXiv.

[3]  Douglas W. Oard,et al.  Overview of ARQMath 2020: CLEF Lab on Answer Retrieval for Questions on Math , 2020, CLEF.

[4]  Ruder-Club Witten,et al.  Ranking , 2020, Bandit Algorithms.

[5]  Jimmy J. Lin,et al.  Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants , 2020, ECIR.

[6]  C. L. Giles,et al.  Accelerating Substructure Similarity Search for Formula Retrieval , 2020, ECIR.

[7]  Douglas W. Oard,et al.  Tangent-CFT: An Embedding Model for Mathematical Formulas , 2019, ICTIR.

[8]  Wei Zhong,et al.  Structural Similarity Search for Formulas Using Leaf-Root Paths in Operator Subtrees , 2019, ECIR.

[9]  Dallas J. Fraser,et al.  Choosing Math Features for BM25 Ranking with Tangent-L , 2018, DocEng.

[10]  Jimmy J. Lin,et al.  Anserini: Enabling the Use of Lucene for Information Retrieval Research , 2017, SIGIR.

[11]  Frank Wm. Tompa,et al.  Multi-Stage Math Formula Search: Using Appearance-Based Similarity Metrics at Scale , 2016, SIGIR.

[12]  ChengXiang Zhai,et al.  Lower-bounding term frequency normalization , 2011, CIKM '11.

[13]  Pinar Donmez,et al.  On the local optimality of LambdaRank , 2009, SIGIR.

[14]  Noriko Kando,et al.  On information retrieval metrics designed for evaluation with incomplete relevance assessments , 2008, Information Retrieval.

[15]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[16]  Andrew S. Lan,et al.  Mathematical Formula Representation via Tree Embeddings , 2021, iTextbooks@AIED.

[17]  Douglas W. Oard,et al.  Overview of ARQMath-2 (2021): Second CLEF Lab on Answer Retrieval for Questions on Math , 2021, CLEF.

[18]  Yin Ki Ng,et al.  Dowsing for Math Answers with Tangent-L , 2020, CLEF.

[19]  Fernando Diaz,et al.  UMass at TREC 2004: Novelty and HARD , 2004, TREC.