A Fuzzy Similarity Measure for XML Documents

The growth of the Internet has resulted in the rapid growth of using XML for data representation and exchange over the Web. Finding the similarity of XML documents is a significant research task in order to effectively control and retrieve information over the web. In this paper, we propose a new approach for determining similarity of XML documents by considering their content and structure. The similarity is computed by using the Sorensen–Dice’s coefficient and fuzzy intersection. We experimentally demonstrate the accuracy of the similarity method using real data sets. Keywords-component; XML document similarity, fuzzy set, string matching.

[1]  Ali Aïtelhadj,et al.  Using structural similarity for clustering XML documents , 2011, Knowledge and Information Systems.

[2]  Bin Wang,et al.  VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.

[3]  Benjamin Z. Yao,et al.  Introduction to a Large-Scale General Purpose Ground Truth Database: Methodology, Annotation Tool and Benchmarks , 2007, EMMCVPR.

[4]  Vilém Novák,et al.  Fuzzy Set , 2009, Encyclopedia of Database Systems.

[5]  Lukasz A. Kurgan,et al.  Semantic Mapping of XML Tags Using Inductive Machine Learning , 2002, ICMLA.

[6]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[7]  Guoliang Li,et al.  Efficient Similarity Search for Tree-Structured Data , 2008, SSDBM.

[8]  Uchang Park,et al.  An Implementation of XML Documents Search System based on Similarity in Structure and Semantics , 2005, International Workshop on Challenges in Web Information Retrieval and Integration.

[9]  Daniel Marcu,et al.  Cognates Can Improve Statistical Translation Models , 2003, NAACL.

[10]  Woosaeng Kim,et al.  XML document similarity measure in terms of the structure and contents , 2008 .

[11]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[12]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[13]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[14]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[15]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[16]  Ralf Behrens,et al.  A Grammar Based Model for XML Schema Integration , 2000, BNCOD.

[17]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..