Layout-based substitution tree indexing and retrieval for mathematical expressions

We introduce a new system for layout-based (LATEX) indexing and retrieval of mathematical expressions using substitution trees. Substitution trees can efficiently store and find expressions based on the similarity of their symbols, symbol layout, sub-expressions and size. We describe our novel implementation and some of our modifications to the substitution tree indexing and retrieval algorithms. We provide an experiment testing our system against the TF-IDF keyword-based system of Zanibbi and Yuan and demonstrate that, in many cases, the quality of search results returned by both systems is comparable (overall means, substitution tree vs. keywordbased: 100% vs. 89% for top 1; 48% vs. 51% for top 5; 22% vs. 28% for top 20). Overall, we present a promising first attempt at layout-based substitution tree indexing and retrieval for mathematical expressions and believe that this method will prove beneficial to the field of mathematical information retrieval.

[1]  Brigitte Pientka Higher-Order Substitution Tree Indexing , 2003, ICLP.

[2]  Richard Zanibbi,et al.  Recognition and retrieval of mathematical expressions , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[3]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[4]  Michael Kohlhase,et al.  A Search Engine for Mathematical Formulae , 2006, AISC.

[5]  Hideki Hashimoto,et al.  Incorporating breadth first search for indexing MathML objects , 2008, 2008 IEEE International Conference on Systems, Man and Cybernetics.

[6]  Andrea Asperti,et al.  A Content Based Mathematical Search Engine: Whelp , 2004, TYPES.

[7]  Rajesh Munavalli,et al.  MathFind: a math-aware search engine , 2006, SIGIR '06.

[8]  Jin Zhao,et al.  Math information retrieval: user requirements and prototype implementation , 2008, JCDL '08.

[9]  Abdou Youssef,et al.  Roles of Math Search in Mathematics , 2006, MKM.

[10]  Abdou Youssef,et al.  Methods of Relevance Ranking and Hit-content Generation in Math Search , 2007, Calculemus/MKM.

[11]  Harold Mouchère,et al.  Stroke-Based Performance Metrics for Handwritten Mathematical Expressions , 2011, 2011 International Conference on Document Analysis and Recognition.

[12]  Peter Graf Substitution Tree Indexing , 1995, RTA.

[13]  Bruce R. Miller,et al.  Technical Aspects of the Digital Library of Mathematical Functions , 2003, Annals of Mathematics and Artificial Intelligence.

[14]  Richard Zanibbi,et al.  Keyword and image-based retrieval of mathematical expressions , 2011, Electronic Imaging.

[15]  Richard Zanibbi,et al.  Applying compiler techniques to diagram recognition , 2002, Object recognition supported by user interaction for service robots.

[16]  Djoerd Hiemstra,et al.  Information Retrieval Models , 2009, Information Retrieval.

[17]  Rajesh Munavalli,et al.  An Approach to Mathematical Search Through Query Formulation and Data Normalization , 2007, Calculemus/MKM.

[18]  Peter Graf,et al.  Term Indexing , 1996, Lecture Notes in Computer Science.

[19]  Ellen M. Voorhees,et al.  Overview of TREC 2003 , 2003, TREC.

[20]  Pia Borlund,et al.  User-Centred Evaluation of Information Retrieval Systems , 2009, Information Retrieval.

[21]  Frank Wm. Tompa,et al.  Improving Mathematics Retrieval , 2009 .

[22]  Abdou Youssef,et al.  An extensive math query language , 2007, SEDE.

[23]  Ellen M. Voorhees,et al.  Overview of TREC 2003. , 2003 .

[24]  Prudence W. Dalrymple,et al.  User-centered evaluation of information retrieval , 1991 .

[25]  James R. Cordy,et al.  The TXL source transformation language , 2006, Sci. Comput. Program..

[26]  Hsi-Jian Lee,et al.  Design of a mathematical expression understanding system , 1997, Pattern Recognit. Lett..

[27]  William A. Martin,et al.  Computer input/output of mathematical expressions , 1971, SYMSAC '71.

[28]  Eva Ericsson,et al.  User-Centered Evaluation of an Information Retrieval System , 2005 .

[29]  David S. Doermann,et al.  The Indexing and Retrieval of Document Images: A Survey , 1998, Comput. Vis. Image Underst..

[30]  Michael Kohlhase,et al.  Re examining the MKM Value Proposition: From Math Web Search to Math Web Re Search , 2007, Calculemus/MKM.

[31]  Stephen M. Watt,et al.  Determining Empirical Characteristics of Mathematical Expression Use , 2005, MKM.

[32]  Donald E. Knuth,et al.  The TeXbook , 1984 .