Accelerating Substructure Similarity Search for Formula Retrieval

Formula retrieval systems using substructure matching are effective, but suffer from slow retrieval times caused by the complexity of structure matching. We present a specialized inverted index and rank-safe dynamic pruning algorithm for faster substructure retrieval. Formulas are indexed from their Operator Tree (OPT) representations. Our model is evaluated using the NTCIR-12 Wikipedia Formula Browsing Task and a new formula corpus produced from Math StackExchange posts. Our approach preserves the effectiveness of structure matching while allowing queries to be executed in real-time.

[1]  Alistair Moffat,et al.  Pruned query evaluation using pre-computed impacts , 2006, SIGIR.

[2]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[3]  Zhi Tang,et al.  A mathematics retrieval system for formulae in layout presentations , 2014, SIGIR.

[4]  Bruce R. Miller,et al.  Technical Aspects of the Digital Library of Mathematical Functions , 2003, Annals of Mathematics and Artificial Intelligence.

[5]  Petr Sojka,et al.  Indexing and Searching Mathematics in Digital Libraries - Architecture, Design and Scalability Issues , 2011, Calculemus/MKM.

[6]  Wei Zhong,et al.  Structural Similarity Search for Formulas Using Leaf-Root Paths in Operator Subtrees , 2019, ECIR.

[7]  Richard Zanibbi,et al.  Recognition and retrieval of mathematical expressions , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[8]  Giovanni Yoko Kristianto,et al.  MCAT Math Retrieval System for NTCIR-12 MathIR Task , 2016, NTCIR.

[9]  Iadh Ounis,et al.  Efficient Query Processing for Scalable Web Search , 2018, Found. Trends Inf. Retr..

[10]  Frank Wm. Tompa,et al.  Multi-Stage Math Formula Search: Using Appearance-Based Similarity Metrics at Scale , 2016, SIGIR.

[11]  Frank Wm. Tompa,et al.  Structural Similarity Search for Mathematics Retrieval , 2013, MKM/Calculemus/DML.

[12]  Venu Govindaraju,et al.  Tangent-V: Math Formula Image Search Using Line-of-Sight Graphs , 2019, ECIR.

[13]  Yuehan Wang,et al.  The Math Retrieval System of ICST for NTCIR-12 MathIR Task , 2016, NTCIR.

[14]  W. Bruce Croft,et al.  Optimization strategies for complex queries , 2005, SIGIR '05.

[15]  Craig MacDonald,et al.  Upper-bound approximations for dynamic pruning , 2011, TOIS.

[16]  Hongfei Yan,et al.  Optimized top-k processing with global page scores on block-max indexes , 2012, WSDM '12.

[17]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[18]  Iadh Ounis,et al.  NTCIR-12 MathIR Task Overview , 2016, NTCIR.

[19]  Torsten Suel,et al.  An Experimental Study of Index Compression and DAAT Query Processing Methods , 2019, ECIR.

[20]  Kenny Davila,et al.  Layout and Semantics: Combining Representations for Mathematical Formula Search , 2017, SIGIR.

[21]  Torsten Suel,et al.  Faster top-k document retrieval using block-max indexes , 2011, SIGIR.

[22]  Svein Erik Bratsberg,et al.  Efficient Compressed Inverted Index Skipping for Disjunctive Text-Queries , 2011, ECIR.

[23]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..