An efficient and scalable algorithm for clustering XML documents by structure

With the standardization of XML as an information exchange language over the Internet, a huge amount of information is formatted in XML documents. In order to analyze this information efficiently, decomposing the XML documents and storing them in relational tables is a popular practice. However, query processing becomes expensive since, in many cases, an excessive number of joins is required to recover information from the fragmented data. If a collection consists of documents with different structures (for example, they come from different DTDs), mining clusters in the documents could alleviate the fragmentation problem. We propose a hierarchical algorithm (S-GRACE) for clustering XML documents based on structural information in the data. The notion of structure graph (s-graph) is proposed, supporting a computationally efficient distance metric defined between documents and sets of documents. This simple metric yields our new clustering algorithm which is efficient and effective, compared to other approaches based on tree-edit distance. Experiments on real data show that our algorithm can discover clusters not easily identified by manual inspection.

[1]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[2]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[3]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[4]  Serge Abiteboul,et al.  Querying and Updating the File , 1993, VLDB.

[5]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[6]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[7]  Roy Goldman,et al.  Lore: a database management system for semistructured data , 1997, SGMD.

[8]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[9]  Horst Bunke,et al.  A graph distance metric based on the maximal common subgraph , 1998, Pattern Recognit. Lett..

[10]  David J. DeWitt,et al.  Relational Databases for Querying XML Documents: Limitations and Opportunities , 1999, VLDB.

[11]  Alin Deutsch,et al.  Storing semistructured data with STORED , 1999, SIGMOD '99.

[12]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[13]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[14]  Masatoshi Yoshikawa,et al.  Storage and Retrieval of XML Documents Using Object-Relational Databases , 1999, DEXA.

[15]  Dominic A. Orchard,et al.  XML Linking Language (XLink) Version 1. 0. World Wide Web Consortium, Proposed Recommendation PR - x , 2000 .

[16]  Steven J. DeRose,et al.  Xml linking language (xlink), version 1. 0 , 2000, WWW 2000.

[17]  Fionn Murtagh,et al.  Clustering of XML documents , 2000 .

[18]  Jeffrey F. Naughton,et al.  Generating Synthetic Complex-Structured XML Data , 2001, WebDB.

[19]  Ehud Gudes,et al.  Exploiting local similarity for indexing paths in graph-structured data , 2002, Proceedings 18th International Conference on Data Engineering.

[20]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[21]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[22]  Donald D. Chamberlin,et al.  XQuery: a query language for XML , 2003, SIGMOD '03.