A Tree-Based Approach to Clustering XML Documents by Structure

We propose a novel methodology for clustering XML documents on the basis of their structural similarities. The idea is to equip each cluster with an XML cluster representative, i.e. an XML document subsuming the most typical structural specifics of a set of XML documents. Clustering is essentially accomplished by comparing cluster representatives, and updating the representatives as soon as new clusters are detected. We present an algorithm for the computation of an XML representative based on suitable techniques for identifying significant node matchings and for reliably merging and pruning XML trees. Experimental evaluation performed on both synthetic and real data shows the effectiveness of our approach.

[1]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[2]  Silvana Castano,et al.  Semantic integration of semistructured and structured data sources , 1999, SGMD.

[3]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[4]  Elio Masciari,et al.  Detecting Structural Similarities between XML Documents , 2002, WebDB.

[5]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[6]  Elisa Bertino,et al.  A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications , 2004, Inf. Syst..

[7]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[8]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[9]  Siu-Ming Yiu,et al.  An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.

[10]  David J. DeWitt,et al.  X-Diff: an effective change detection algorithm for XML documents , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[11]  Antoine Doucet,et al.  Naïve Clustering of a large XML Document Collection , 2002, INEX Workshop.

[12]  J. Rowling X-Diff : A Fast Change Detection Algorithm for XML Documents , 2003 .

[13]  Jennifer Widom,et al.  Change detection in hierarchically structured information , 1996, SIGMOD '96.

[14]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[15]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[16]  Fosca Giannotti,et al.  Clustering Transactional Data , 2002, PKDD.

[17]  Denilson Barbosa,et al.  The XML web: a first study , 2003, WWW '03.