论文信息 - Clustering XML Documents Based on Structural Similarity

Clustering XML Documents Based on Structural Similarity

In this paper, we present a framework for clustering XML documents based on structural similarity between XML documents. Firstly, the validity of using the edit distance between XML documents and schemata as the structural similarity is presented. Secondly, a novel solution is given for schema extraction. The solution is based on the minimum length description (MLD) principle, and allows tradeoff between the schema simplicity and precision based on the user's specification. Thirdly, clustering XML documents based on the edit distance is discussed. The efficacy and efficiency of our methodology have been tested using both real and synthesized data.

Zhonghang Xia | Guangming Xing | Jinhua Guo

[1] Z. Galil,et al. Pattern matching algorithms , 1997 .

[2] Yanchun Zhang,et al. Frontiers of WWW Research and Development - APWeb 2006, 8th Asia-Pacific Web Conference, Harbin, China, January 16-18, 2006, Proceedings , 2006, APWeb.

[3] Boris Chidlovskii. Schema Extraction from XML: A Grammatical Inference Approach , 2001, KRDB.

[4] Alfred V. Aho,et al. The Design and Analysis of Computer Algorithms , 1974 .

[5] Aristides Gionis,et al. XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD 2000.

[6] Timos K. Sellis,et al. A methodology for clustering XML documents by structure , 2006, Inf. Syst..

[7] Makoto Murata,et al. Hedge automata: a formal model for xml schemata , 1999 .

[8] Nobutaka Suzuki,et al. Finding an optimum edit script between an XML document and a DTD , 2005, SAC '05.

[9] Guangming Xing. Fast Approximate Matching Between XML Documents and Schemata , 2006, APWeb.

[10] Kaizhong Zhang,et al. Approximate tree pattern matching , 1997 .

[11] H. V. Jagadish,et al. Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[12] Ken Thompson,et al. Programming Techniques: Regular expression search algorithm , 1968, Commun. ACM.

[13] George Karypis,et al. CLUTO - A Clustering Toolkit , 2002 .