Toward Semantic XML Clustering

The increasing availability of heterogeneous XML informative sources has raised a number of issues concerning how to represent and manage semistructured data. Although XML sources can exhibit proper structures and contents, differently annotated XML documents may in principle encode related semantics due to subjective definitions of markup tags. Discovering knowledge to infer semantic organization of XML documents has become a major challenge in XML data management. In this context, we address the problem of clustering XML data according to structure as well as content features enriched with lexical ontology knowledge. We propose a framework for clustering semantically cohesive XML structures based on a transactional representation model. Experiments on large real datasets give evidence that the proposed approach is highly effective in detecting groups of XML data that exhibit structure and/or content affinities.

[1]  Charu C. Aggarwal,et al.  XRules: an effective structural classifier for XML data , 2003, KDD '03.

[2]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[3]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[4]  Ioana Manolescu,et al.  The XML benchmark project , 2001 .

[5]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[6]  Siu-Ming Yiu,et al.  An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.

[7]  Gianni Costa,et al.  A Tree-Based Approach to Clustering XML Documents by Structure , 2004, PKDD.

[8]  Antoine Doucet,et al.  Naïve Clustering of a large XML Document Collection , 2002, INEX Workshop.

[9]  Jennifer Widom Data Management for XML: Research Directions , 1999, IEEE Data Eng. Bull..

[10]  Sergio Greco,et al.  Repairs and Consistent Answers for XML Data with Functional Dependencies , 2003, Xsym.

[11]  Marcelo Arenas,et al.  A normal form for XML documents , 2002, PODS '02.

[12]  Neoklis Polyzotis,et al.  Structure and Value Synopses for XML Data Graphs , 2002, VLDB.

[13]  Vijay V. Raghavan,et al.  BitCube: A Three-Dimensional Bitmap Indexing for XML Documents , 2004, Journal of Intelligent Information Systems.

[14]  Fosca Giannotti,et al.  Clustering Transactional Data , 2002, PKDD.

[15]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[16]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[17]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[18]  Gerhard Weikum,et al.  Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data , 2003, WebDB.