Retrieving documents with mathematical content

Many documents with mathematical content are published on the Web, but conventional search engines that rely on keyword search only cannot fully exploit their mathematical information. In particular, keyword search is insufficient when expressions in a document are not annotated with natural keywords or the user cannot describe her query with keywords. Retrieving documents by querying their mathematical content directly is very appealing in various domains such as education, digital libraries, engineering, patent documents, medical sciences, etc. Capturing the relevance of mathematical expressions also greatly enhances document classification in such domains. Unlike text retrieval, where keywords carry enough semantics to distinguish text documents and rank them, math symbols do not contain much semantic information on their own. In fact, mathematical expressions typically consist of few alphabetical symbols organized in rather complex structures. Hence, the structure of an expression, which describes the way such symbols are combined, should also be considered. Unfortunately, there is no standard testbed with which to evaluate the effectiveness of a mathematics retrieval algorithm. In this paper we study the fundamental and challenging problems in mathematics retrieval, that is how to capture the relevance of mathematical expressions, how to query them, and how to evaluate the results. We describe various search paradigms and propose retrieval systems accordingly. We discuss the benefits and drawbacks of each approach, and further compare them through an extensive empirical study.

[1]  George Labahn,et al.  A new approach for recognizing handwritten mathematics using relational grammars and fuzzy sets , 2013, International Journal on Document Analysis and Recognition (IJDAR).

[2]  Frank Wm. Tompa,et al.  A new mathematics retrieval system , 2010, CIKM '10.

[3]  Stephen M. Watt,et al.  Communicating Mathematics via Pen-Based Interfaces , 2008, 2008 10th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.

[4]  Amit Pillay Intelligent Combination of Structural Analysis Algorithms: Application to Mathematical Expression Recognition , 2014 .

[5]  Ferruccio Guidi,et al.  A Query Language for a Metadata Framework about Mathematical Resources , 2003, MKM.

[6]  Richard Zanibbi,et al.  Keyword and image-based retrieval of mathematical expressions , 2011, Electronic Imaging.

[7]  Petr Sojka,et al.  The art of mathematics retrieval , 2011, DocEng '11.

[8]  Frank Wm. Tompa,et al.  Structural Similarity Search for Mathematics Retrieval , 2013, MKM/Calculemus/DML.

[9]  Leo Galamboš,et al.  System Description : EgoMath 2 As a Tool for Mathematical Searching on Wikipedia , .

[10]  Stephen M. Watt,et al.  Mathematical Markup Language (MathML) Version 3.0 , 2001, WWW 2001.

[11]  Siu Cheung Hui,et al.  A math-aware search engine for math question answering system , 2012, CIKM '12.

[12]  Richard Zanibbi,et al.  Recognition and retrieval of mathematical expressions , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[13]  Leo Galambos,et al.  System Description: EgoMath2 As a Tool for Mathematical Searching on Wikipedia.org , 2011, Calculemus/MKM.

[14]  Frank Wm. Tompa,et al.  Improving Mathematics Retrieval , 2009 .

[15]  Abdou Youssef,et al.  Search of Mathematical Contents: Issues And Methods , 2005, IASSE.

[16]  Michael Kohlhase,et al.  A Search Engine for Mathematical Formulae , 2006, AISC.

[17]  Jin Zhao,et al.  Math information retrieval: user requirements and prototype implementation , 2008, JCDL '08.

[18]  Richard J. Fateman,et al.  Searching techniques for integral tables , 1995, ISSAC '95.

[19]  Li Yu,et al.  Math Spotting: Retrieving Math in Technical Documents Using Handwritten Query Images , 2011, 2011 International Conference on Document Analysis and Recognition.

[20]  Richard Zanibbi,et al.  Layout-based substitution tree indexing and retrieval for mathematical expressions , 2012, Electronic Imaging.

[21]  Mohand Boughanem,et al.  XML Information Retrieval through Tree Edit Distance and Structural Summaries , 2011, AIRS.

[22]  Luis Gravano,et al.  Efficient IR-Style Keyword Search over Relational Databases , 2003, VLDB.

[23]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[24]  Rajesh Munavalli,et al.  MathFind: a math-aware search engine , 2006, SIGIR '06.

[25]  Johnson Apacible,et al.  Answering math queries with search engines , 2012, WWW.

[26]  Moshé M. Zloof Query-by-example: the invocation and definition of tables and forms , 1975, VLDB '75.

[27]  Sihem Amer-Yahia,et al.  Texquery: a full-text search extension to xquery , 2004, WWW '04.

[28]  Nikolaus Augsten,et al.  RTED: A Robust Algorithm for the Tree Edit Distance , 2011, Proc. VLDB Endow..

[29]  Abdou Youssef,et al.  Methods of Relevance Ranking and Hit-content Generation in Math Search , 2007, Calculemus/MKM.