Efficient Algorithm for Math Formula Semantic Search

Mathematical formulae play an important role in many scientific domains. Regardless of the importance of mathematical formula search, conventional keyword-based retrieval methods are not sufficient for searching mathematical formulae, which are structured as trees. The increasing number as well as the structural complexity of mathematical formulae in scientific articles lead to the necessity for large-scale structureaware formula search techniques. In this paper, we formulate three types of measures that represent distinctive features of semantic similarity of math formulae, and develop efficient hash-based algorithms for the approximate calculation. Our experiments using NTCIR-11 Math-2 Task dataset, a large-scale test collection for math information retrieval with about 60million formulae, show that the proposed method improves the search precision while also keeps the scalability and runtime efficiency high. key words: tree hashing, MathML, mathematical formula search, information retrieval

[1]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[2]  Petr Sojka,et al.  The art of mathematics retrieval , 2011, DocEng '11.

[3]  Richard Zanibbi,et al.  Combining TF-IDF Text Retrieval with an Inverted Index over Symbol Pairs in Math Expressions: The Tangent Math Search Engine at NTCIR 2014 , 2014, NTCIR.

[4]  Iadh Ounis,et al.  NTCIR-11 Math-2 Task Overview , 2014, NTCIR.

[5]  Volker Markl,et al.  Evaluation of Similarity-Measure Factors for Formulae Based on the NTCIR-11 Math Task , 2014, NTCIR.

[6]  Aoying Zhou,et al.  XML Structural Similarity Search Using MapReduce , 2010, WAIM.

[7]  Minh-Quoc Nghiem,et al.  The MCAT Math Retrieval System for NTCIR-11 Math Track , 2014, NTCIR.

[8]  Frank Wm. Tompa,et al.  Retrieving documents with mathematical content , 2013, SIGIR.

[9]  Anthony K. H. Tung,et al.  Similarity evaluation on tree-structured data , 2005, SIGMOD '05.

[10]  Michael H. Böhlen,et al.  The pq-gram distance between ordered labeled trees , 2010, TODS.

[11]  R. Tarjan Amortized Computational Complexity , 1985 .

[12]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[13]  Peter Graf Substitution Tree Indexing , 1995, RTA.

[14]  Wolf-Tilo Balke,et al.  QUALIBETA at the NTCIR-11 Math 2 Task: An Attempt to Query Math Collections , 2014, NTCIR.

[15]  Zhi Tang,et al.  A mathematics retrieval system for formulae in layout presentations , 2014, SIGIR.

[16]  Michael Kohlhase,et al.  MathWebSearch at NTCIR-11 , 2014, NTCIR.

[17]  Richard Zanibbi,et al.  Recognition and retrieval of mathematical expressions , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[18]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[19]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[20]  Petr Sojka,et al.  Math Indexer and Searcher under the Hood: History and Development of a Winning Strategy , 2014, NTCIR.

[21]  Aron Culotta,et al.  Dependency Tree Kernels for Relation Extraction , 2004, ACL.

[22]  Nikolaus Augsten,et al.  RTED: A Robust Algorithm for the Tree Edit Distance , 2011, Proc. VLDB Endow..

[23]  Claudio Sacerdoti Coen,et al.  A Survey on Retrieval of Mathematical Knowledge , 2016, Math. Comput. Sci..

[24]  Yuehan Wang,et al.  ICST Math Retrieval System for NTCIR-11 Math-2 Task , 2014, NTCIR.

[25]  Frank Wm. Tompa,et al.  The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search , 2015, ArXiv.

[26]  Volker Markl,et al.  Challenges of Mathematical Information Retrievalin the NTCIR-11 Math Wikipedia Task , 2015, SIGIR.

[27]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[28]  Jean-Philippe Vert,et al.  A tree kernel to analyse phylogenetic profiles , 2002, ISMB.

[29]  Frank Wm. Tompa,et al.  A new mathematics retrieval system , 2010, CIKM '10.

[30]  Allan Hanbury,et al.  TUW-IMP at the NTCIR-11 Math-2 , 2014, NTCIR.