Integrating Content and Structure into a Comprehensive Framework for XML Document Similarity Represented in 3D Space

XML is attractive for data exchange between different platforms, and the number of XML documents is rapidly increasing. This raised the need for techniques capable of investigating the similarity between XML documents to help in classifying them for better organized utilization. In fact, the idea of similarity between documents is not new. However, XML documents are more rich and informative than classical documents in the sense that they encapsulate both structure and content; on the other hand, classical documents are characterized only by the content. According, using both the content and structure of XML documents to assign a similarity metric is relatively new. Of the recent research and algorithms proposed in the literature, the majority assign a similarity metric between 0.0 and 1.0 when comparing two XML documents. The similarity measures between multiple XML documents may be arranged in a matrix whereby data mining may be done to cluster closely related documents. In this chapter the authors have presented a novel way to represent XML document similarity in 3D space. Their approach benefits from the characteristics of the XML documents to produce a measure to be used in clustering and classification techniques, information retrieval and searching methods for the case of XML documents. We mainly derive a three dimensional vector per document by considering two dimensions as the document’s structural and content, while the third dimension is a combination of both structure and content characteristics of the document. The outcome from our research allows users to intuitively visualize document similarity.

[1]  Dina Q. Goldin,et al.  On Similarity Queries for Time-Series Data: Constraint Specification and Implementation , 1995, CP.

[2]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[3]  Jung-Won Lee,et al.  XML Document Analysis based on Similarity , 2002 .

[4]  Elio Masciari,et al.  Fast detection of XML structural similarity , 2005, IEEE Transactions on Knowledge and Data Engineering.

[5]  Park Uchang,et al.  An Implementation of XML document searching system based on Structure and Semantics Similarity , 2005 .

[6]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[7]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[8]  Woosaeng Kim,et al.  XML document similarity measure in terms of the structure and contents , 2008 .

[9]  Siu-Ming Yiu,et al.  An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.

[10]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[11]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[12]  Hector Garcia-Molina,et al.  Meaningful change detection in structured data , 1997, SIGMOD '97.

[13]  Richard Chbeir,et al.  Content and Structure Based Approach For XML Similarity , 2005, The Fifth International Conference on Computer and Information Technology (CIT'05).

[14]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[15]  Kaizhong Zhang,et al.  Approximate Tree Matching in the Presence of Variable Length Don't Cares , 1994, J. Algorithms.

[16]  Leo Obrst,et al.  The Semantic Web: A Guide to the Future of XML, Web Services and Knowledge Management , 2003 .

[17]  Evan Lenz,et al.  Office 2003 XML - integrating office with the rest of the world , 2004 .

[18]  Li Yang,et al.  Pruning and visualizing generalized association rules in parallel coordinates , 2005, IEEE Transactions on Knowledge and Data Engineering.