Dowsing for Math Answers with Tangent-L

We present our application of the math-aware search engine Tangent-L to the ARQMath Community Question Answering (CQA) task. Our approach performs well, placing in the top three positions out of all 23 submissions, including the baseline runs. Tangent-L, built on the text search platform Lucene, handles math formulae by first converting a formula’s Presentation MathML representation into a Symbol Layout Tree, followed by extraction of math tuples from the tree that serve as search terms. It applies BM25 ranking to all math tuples and natural language terms in a document during searching. For the CQA task, we index all question-answer pairs in the Math Stack Exchange corpus. At query time, we first convert a topic question into a bag of formulae and keywords that serves as a formal query. We then execute the queries using Tangent-L to find the best matches. Finally, we re-rank the matches by a regression model that was trained on metadata attributes from the corpus. Our primary run produces an nDCG′ value of 0.278 and MAP′ value of 0.063, where these are two common measures of quality for ranked retrieval. However, our best performance, an nDCG′ value of 0.345 and MAP′ value of 0.139, is achieved by an alternate run without re-ranking. Follow-up experiments help to explain which aspects of our approach lead to our success.

[1]  Preslav Nakov,et al.  SemEval-2016 Task 3: Community Question Answering , 2019, *SEMEVAL.

[2]  Iadh Ounis,et al.  NTCIR-10 Math Pilot Task Overview , 2013, NTCIR.

[3]  Deanna C. Pineau Math-Aware Search Engines: Physics Applications and Overview , 2016, ArXiv.

[4]  Abhishek Gupta,et al.  A Document Retrieval System for Math Queries , 2016, NTCIR.

[5]  Volker Markl,et al.  Challenges of Mathematical Information Retrievalin the NTCIR-11 Math Wikipedia Task , 2015, SIGIR.

[6]  Robert Muir,et al.  Apache Lucene 4 , 2012, OSIR@SIGIR.

[7]  Douglas W. Oard,et al.  Overview of ARQMath 2020: CLEF Lab on Answer Retrieval for Questions on Math , 2020, CLEF.

[8]  Frank Wm. Tompa,et al.  Multi-Stage Math Formula Search: Using Appearance-Based Similarity Metrics at Scale , 2016, SIGIR.

[9]  Claudio Sacerdoti Coen,et al.  A Survey on Retrieval of Mathematical Knowledge , 2015, Mathematics in Computer Science.

[10]  Iadh Ounis,et al.  NTCIR-11 Math-2 Task Overview , 2014, NTCIR.

[11]  Preslav Nakov,et al.  SemEval-2017 Task 3: Community Question Answering , 2017, *SEMEVAL.

[12]  Dallas J. Fraser,et al.  Choosing Math Features for BM25 Ranking with Tangent-L , 2018, DocEng.

[13]  Petr Sojka,et al.  Math Indexer and Searcher under the Hood: Fine-tuning Query Expansion and Unification Strategies , 2016, NTCIR.

[14]  Frank Wm. Tompa,et al.  Tangent-3 at the NTCIR-12 MathIR Task , 2016, NTCIR.

[15]  C. L. Giles,et al.  Accelerating Substructure Similarity Search for Formula Retrieval , 2020, ECIR.

[16]  Ronan Le Bras,et al.  SemEval-2019 Task 10: Math Question Answering , 2019, *SEMEVAL.

[17]  Fraser Dallas Math Information Retrieval using a Text Search Engine , 2018 .

[18]  Giovanni Yoko Kristianto,et al.  MCAT Math Retrieval System for NTCIR-12 MathIR Task , 2016, NTCIR.

[19]  Craig MacDonald,et al.  Terrier Information Retrieval Platform , 2005, ECIR.

[20]  Stephen M. Watt,et al.  Mathematical Markup Language (MathML) Version 3.0 , 2001, WWW 2001.

[21]  María-Dolores Olvera-Lobo,et al.  Question Answering Track Evaluation in TREC, CLEF and NTCIR , 2015, WorldCIST.

[22]  ChengXiang Zhai,et al.  Lower-bounding term frequency normalization , 2011, CIKM '11.

[23]  Vít Novotný,et al.  Quo Vadis, Math Information Retrieval , 2019, RASLAN.

[24]  Preslav Nakov,et al.  SemEval-2015 Task 3: Answer Selection in Community Question Answering , 2015, *SEMEVAL.

[25]  Eugene Agichtein,et al.  Overview of the Medical Question Answering Task at TREC 2017 LiveQA , 2017, TREC.

[26]  Yuehan Wang,et al.  The Math Retrieval System of ICST for NTCIR-12 MathIR Task , 2016, NTCIR.

[27]  Douglas W. Oard,et al.  Finding Old Answers to New Math Questions: The ARQMath Lab at CLEF 2020 , 2020, ECIR.

[28]  Kenny Davila,et al.  Layout and Semantics: Combining Representations for Mathematical Formula Search , 2017, SIGIR.

[29]  Petr Sojka,et al.  The art of mathematics retrieval , 2011, DocEng '11.

[30]  Iadh Ounis,et al.  NTCIR-12 MathIR Task Overview , 2016, NTCIR.

[31]  Moritz Schubotz,et al.  Exploring the One-brain Barrier: A Manual Contribution to the NTCIR-12 MathIR Task , 2016, NTCIR.