Feature extraction and clustering-based retrieval for mathematical formulas

Mathematical formulas or expressions are essential for presenting scientific knowledge in many research documents in academic areas such as physics and mathematics. Searching for related mathematical formulas is an important but challenging problem as formulas contain both structural and semantic information. Such information is hidden inside the mathematical expressions of the formulas. To support effective formula search, it is necessary to extract the structural and semantic features from the mathematical presentation of the formulas faithfully. In this paper, we propose an effective approach for formula feature extraction. To evaluate the proposed approach, the extracted features are tested with three popular clustering algorithms, namely K-means, Self Organizing Map (SOM), and Agglomerative Hierarchical Clustering (AHC), for formula retrieval. The performance of the clustering-based retrieval is measured based on a dataset of 881 formulas and promising results have been achieved.

[1]  Michael Kohlhase Markup for Mathematical Knowledge , 2006 .

[2]  Siu Cheung Hui,et al.  Mathematical Document Retrieval for Problem Solving , 2009, 2009 International Conference on Computer Engineering and Technology.

[3]  Mirco Nanni,et al.  Speeding-Up Hierarchical Agglomerative Clustering in Presence of Expensive Metrics , 2005, PAKDD.

[4]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[5]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[6]  Michael Kohlhase,et al.  An Open Markup Format for Mathematical Documents , 2010 .

[7]  Abdou Youssef,et al.  Equivalence detection using parse-tree normalization for math search , 2007, 2007 2nd International Conference on Digital Information Management.

[8]  Rajesh Munavalli,et al.  MathFind: a math-aware search engine , 2006, SIGIR '06.

[9]  J. Misutka,et al.  Mathematical Extension of Full Text Search Engine Indexer , 2008, 2008 3rd International Conference on Information and Communication Technologies: From Theory to Applications.

[10]  Abdou Youssef,et al.  Search of Mathematical Contents: Issues And Methods , 2005, IASSE.

[11]  Bruce R. Miller,et al.  Technical Aspects of the Digital Library of Mathematical Functions , 2003, Annals of Mathematics and Artificial Intelligence.