WikiMirs 3.0: A Hybrid MIR System Based on the Context, Structure and Importance of Formulae in a Document

Nowadays, mathematical information is increasingly available in websites and repositories, such like ArXiv, Wikipedia and growing numbers of digital libraries. Mathematical formulae are highly structured and usually presented in layout presentations, such as PDF, LATEX and Presentation MathML. The differences of presentation between text and formulae challenge traditional text-based index and retrieval methods. To address the challenge, this paper proposes an upgraded Mathematical Information Retrieval (MIR) system, namely WikiMirs 3.0, based on the context, structure and importance of formulae in a document. In WikiMirs 3.0, users can easily "cut" formulae and contexts from PDF documents as well as type in queries. Furthermore, a novel hybrid indexing and matching model is proposed to support both exact and fuzzy matching. In the hybrid model, both context and structure information of formulae are taken into consideration. In addition, the concept of formula importance within a document is introduced into the model for more reasonable ranking. Experimental results, compared with two classical MIR systems, demonstrate that the proposed system along with the novel model provides higher accuracy and better ranking results over Wikipedia.

[1]  Jozef Mišutka,et al.  Extending Full Text Search Engine for Mathematical Content , 2008 .

[2]  Xiaozhong Liu Generating metadata for cyberlearning resources through information retrieval and meta-search , 2013, J. Assoc. Inf. Sci. Technol..

[3]  Zhi Tang,et al.  A mathematics retrieval system for formulae in layout presentations , 2014, SIGIR.

[4]  Richard Zanibbi,et al.  Layout-based substitution tree indexing and retrieval for mathematical expressions , 2012, Electronic Imaging.

[5]  Wolf-Tilo Balke,et al.  QUALIBETA at the NTCIR-11 Math 2 Task: An Attempt to Query Math Collections , 2014, NTCIR.

[6]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR Forum.

[7]  Minh-Quoc Nghiem,et al.  The MCAT Math Retrieval System for NTCIR-11 Math Track , 2014, NTCIR.

[8]  Bruce R. Miller,et al.  Technical Aspects of the Digital Library of Mathematical Functions , 2003, Annals of Mathematics and Artificial Intelligence.

[9]  Frank Wm. Tompa,et al.  Retrieving documents with mathematical content , 2013, SIGIR.

[10]  Richard Zanibbi,et al.  Recognizing Mathematical Expressions Using Tree Transformation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Zhi Tang,et al.  Cross-reference identification within a PDF document , 2015, Electronic Imaging.

[12]  Volker Markl,et al.  Evaluation of Similarity-Measure Factors for Formulae Based on the NTCIR-11 Math Task , 2014, NTCIR.

[13]  Siu Cheung Hui,et al.  A lattice-based approach for mathematical search using Formal Concept Analysis , 2012, Expert Syst. Appl..

[14]  Christian P. Robert In Pursuit of the Unknown: 17 Equations That Changed the World , 2013 .

[15]  Siu Cheung Hui,et al.  A math-aware search engine for math question answering system , 2012, CIKM '12.

[16]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[17]  Allan Hanbury,et al.  TUW-IMP at the NTCIR-11 Math-2 , 2014, NTCIR.

[18]  Petr Sojka,et al.  Indexing and Searching Mathematics in Digital Libraries - Architecture, Design and Scalability Issues , 2011, Calculemus/MKM.

[19]  Michael Kohlhase,et al.  A Search Engine for Mathematical Formulae , 2006, AISC.

[20]  Petr Sojka,et al.  Math Indexer and Searcher under the Hood: History and Development of a Winning Strategy , 2014, NTCIR.

[21]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR '00.

[22]  Michael Kohlhase,et al.  MathWebSearch at NTCIR-11 , 2014, NTCIR.

[23]  Yuehan Wang,et al.  ICST Math Retrieval System for NTCIR-11 Math-2 Task , 2014, NTCIR.

[24]  Iadh Ounis,et al.  NTCIR-11 Math-2 Task Overview , 2014, NTCIR.

[25]  Rajesh Munavalli,et al.  An Approach to Mathematical Search Through Query Formulation and Data Normalization , 2007, Calculemus/MKM.

[26]  Ian Stewart,et al.  In Pursuit of the Unknown: 17 Equations That Changed the World , 2013 .

[27]  ZanibbiRichard,et al.  Recognizing Mathematical Expressions Using Tree Transformation , 2002 .

[28]  Ying Liu,et al.  Structure extraction from PDF-based book documents , 2011, JCDL '11.

[29]  Zhi Tang,et al.  WikiMirs: a mathematical information retrieval system for wikipedia , 2013, JCDL '13.

[30]  Richard Zanibbi,et al.  Combining TF-IDF Text Retrieval with an Inverted Index over Symbol Pairs in Math Expressions: The Tangent Math Search Engine at NTCIR 2014 , 2014, NTCIR.