Keyword and image-based retrieval of mathematical expressions

Two new methods for retrieving mathematical expressions using conventional keyword search and expression images are presented. An expression-level TF-IDF (term frequency-inverse document frequency) approach is used for keyword search, where queries and indexed expressions are represented by keywords taken from LATEX strings. TF-IDF is computed at the level of individual expressions rather than documents to increase the precision of matching. The second retrieval technique is a form of Content-Based Image Retrieval (CBIR). Expressions are segmented into connected components, and then components in the query expression and each expression in the collection are matched using contour and density features, aspect ratios, and relative positions. In an experiment using ten randomly sampled queries from a corpus of over 22,000 expressions, precision-at-k (k = 20) for the keyword-based approach was higher (keyword: μ = 84.0, σ = 19.0, imagebased: μ = 32.0, σ = 30.7), but for a few of the queries better results were obtained using a combination of the two techniques.

[1]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[2]  Bruce R. Miller,et al.  Technical Aspects of the Digital Library of Mathematical Functions , 2003, Annals of Mathematics and Artificial Intelligence.

[3]  Masayuki Okamoto,et al.  Structure analysis and recognition of mathematical expressions , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[4]  Giovanni Soda,et al.  Mathematical Symbol Indexing Using Topologically Ordered Clusters of Shape Contexts , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[5]  John Davies,et al.  Information Retrieval: Searching in the 21st Century , 2009, Information Retrieval.

[6]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[7]  R. Manmatha,et al.  Word spotting for historical documents , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[8]  Gavriel Salvendy,et al.  Human Interface and the Management of Information. Designing Information Environments, Symposium on Human Interface 2009, Held as Part of HCI International 2009, San Diego, CA, USA, July 19-24, 2009, Procceedings, Part I , 2009, HCI.

[9]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Abdou Youssef,et al.  Methods of Relevance Ranking and Hit-content Generation in Math Search , 2007, Calculemus/MKM.

[11]  Gavriel Salvendy,et al.  Human Interface and the Management of Information. Interacting with Information - Symposium on Human Interface 2011, Held as Part of HCI International 2011, Orlando, FL, USA, July 9-14, 2011, Proceedings, Part II , 2011, HCI.

[12]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[13]  James H. Davenport,et al.  Unifying Math Ontologies: A Tale of Two Standards , 2009, Calculemus/MKM.

[14]  Masakazu Suzuki,et al.  INFTY: an integrated OCR system for mathematical documents , 2003, DocEng '03.