A Study on XML Path Similarity

The data model of XML document can be labeled as a tag tree of element nodes. Such tree model can be represented by the set of paths from the root node to leaf nodes, which describes the structure of XML document. This paper presents an approach for measuring similarity between two XML paths that consists of (1) ElementSim, a similarity function specifically designed for measuring linguistic similarity between two elements in two different paths, which take into account both semantic and syntactical information of elements. (2) NPathSim, a similarity function specifically designed for measuring similarity between two paths, which combines both the linguistic similarity between elements and the context descriptions of paths. Path retrieval was performed to evaluate the quality of NPathSim. The experiments show the proposed similarity approach can achieve higher quality on XML data set.