An Evaluation of Similarity Search Methods Blending Structures and Keywords in XML Documents

For the past few years, hundreds of document-formats based on XML have appeared. Office documents are typical examples of XML documents. Besides, demands for searching documents become increasing and complicated since we need not only keyword search but also similarity search. In our previous work, we proposed LAX+, an algorithm for measuring a similarity value between XML trees. However, there is a problem that LAX+ performs a rigid matching at leaf-nodes of XML trees. In this paper, we propose two methods: KLAX and LAX&KEY. To measure a precise similarity value between leaf-nodes, KLAX improves LAX+ by-checking the number of common keywords in the leaf-nodes. LAX&KEY separately measures a similarity value between XML trees by LAX+ and a similarity value of common keywords in XML trees, and then combines them. In our experiments with docx, xlsx, and pptx files, the proposed methods yield better results in precision and recall.