Using structural similarity for clustering XML documents

In this paper, we describe a method for clustering XML documents. Its goal is to group documents sharing similar structures. Our approach is two-step. We first automatically extract the structure from each XML document to be classified. This extracted structure is then used as a representation model to classify the corresponding XML document. The idea behind the clustering is that if XML documents share similar structures, they are more likely to correspond to the structural part of the same query. Finally, for the experimentation purpose, we tested our algorithms on both real (ACM SIGMOD Record corpus) and synthetic data. The results clearly demonstrate the interest of our approach.

[1]  Sudarshan S. Chawathe,et al.  Comparing Hierarchical Data in External Memory , 1999, VLDB.

[2]  Mong-Li Lee,et al.  XClust: clustering XML schemas for effective integration , 2002, CIKM '02.

[3]  Riccardo Ortale,et al.  Distance-based Clustering of XML Documents , 2003 .

[4]  Richi Nayak,et al.  Fast and effective clustering of XML data using structural information , 2008, Knowledge and Information Systems.

[5]  Yves Lechevallier,et al.  A Flexible Structured-Based Representation for XML Document Mining , 2005, INEX.

[6]  Naren Ramakrishnan,et al.  Discovering excitatory relationships using dynamic Bayesian networks , 2011, Knowledge and Information Systems.

[7]  Gianni Costa,et al.  A Tree-Based Approach to Clustering XML Documents by Structure , 2004, PKDD.

[8]  Timos K. Sellis,et al.  A methodology for clustering XML documents by structure , 2006, Inf. Syst..

[9]  Antoine Doucet,et al.  Unsupervised Classification of Text-Centric XML Document Collections , 2006, INEX.

[10]  Kaizhong Zhang,et al.  A System for Approximate Tree Matching , 1994, IEEE Trans. Knowl. Data Eng..

[11]  Timos K. Sellis,et al.  Clustering XML Documents Using Structural Summaries , 2004, EDBT Workshops.

[12]  Gongzhu Hu,et al.  Querying and indexing XML documents , 2005, J. Comput. Methods Sci. Eng..

[13]  Michael J. Franklin,et al.  A Fast Index for Semistructured Data , 2001, VLDB.

[14]  Hongyan Liu,et al.  TagClus: a random walk-based method for tag clustering , 2010, Knowledge and Information Systems.

[15]  Richard Chbeir,et al.  Efficient XML Structural Similarity Detection using Sub-tree Commonalities , 2007, SBBD.

[16]  Gabriella Kazai,et al.  Advances in XML Information Retrieval and Evaluation, 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2005, Dagstuhl Castle, Germany, November 28-30, 2005, Revised Selected Papers , 2006, INEX.

[17]  Mohammed J. Zaki Efficiently mining frequent trees in a forest , 2002, KDD.

[18]  Elio Masciari,et al.  Fast detection of XML structural similarity , 2005, IEEE Transactions on Knowledge and Data Engineering.

[19]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[20]  Ah Chung Tsoi,et al.  Document Mining Using Graph Neural Network , 2006, INEX.

[21]  Elisa Bertino,et al.  A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications , 2004, Inf. Syst..

[22]  Beng Chin Ooi,et al.  XR-tree: indexing XML data for efficient structural joins , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[23]  Le Jiajin,et al.  Clustering XML Documents by Combining Content and Structure , 2008, 2008 International Symposium on Information Science and Engineering.

[24]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[25]  Richi Nayak,et al.  XML Documents Clustering by Structures , 2005, INEX.

[26]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[27]  Karen Sauvagnat Mod`ele flexible pour la Recherche d'Information dans des corpus de documents semi-structur´es , 2005 .

[28]  Anne Laurent,et al.  Recherche de sous-structures fréquentes pour l'intégration de schémas XML , 2006, EGC.

[29]  Ah Chung Tsoi,et al.  XML Document Mining Using Contextual Self-organizing Maps for Structures , 2006, INEX.

[30]  Ludovic Denoyer,et al.  Apprentissage et inférence statistique dans les bases de documents structurés : application aux corpus de documents textuels , 2004 .

[31]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[32]  Ah Chung Tsoi,et al.  Clustering XML Documents Using Self-organizing Maps for Structures , 2005, INEX.

[33]  Wolfgang Meier,et al.  eXist: An Open Source Native XML Database , 2002, Web, Web-Services, and Database Systems.

[34]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[35]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[36]  Richi Nayak,et al.  Evaluating the Performance of XML Document Clustering by Structure Only , 2006, INEX.

[37]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[38]  Eiichi Tanaka,et al.  The Tree-to-Tree Editing Problem , 1988, Int. J. Pattern Recognit. Artif. Intell..

[39]  William Kwok-Wai Cheung,et al.  Learning element similarity matrix for semi-structured document analysis , 2008, Knowledge and Information Systems.

[40]  Alexandre Termier,et al.  TreeFinder: a first step towards XML data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[41]  Jennifer Widom,et al.  Change detection in hierarchically structured information , 1996, SIGMOD '96.

[42]  Hector Garcia-Molina,et al.  Meaningful change detection in structured data , 1997, SIGMOD '97.

[43]  Philip N. Klein,et al.  Computing the Edit-Distance between Unrooted Ordered Trees , 1998, ESA.

[44]  Jianzhong Li,et al.  Practical Indexing XML Document for Twig Query , 2005, ASIAN.

[45]  Gabriella Kazai INitiative for the Evaluation of XML Retrieval , 2009, Encyclopedia of Database Systems.

[46]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[47]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[48]  Quanzhong Li,et al.  Indexing and Querying XML Data for Regular Path Expressions , 2001, VLDB.

[49]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[50]  Erik D. Demaine,et al.  An optimal decomposition algorithm for tree edit distance , 2006, TALG.

[51]  Ludovic Denoyer,et al.  Classification automatique de documents structurés. Application au corpus d'arbres étiquetés de type XML , 2005, CORIA.

[52]  Vincent T. Y. Ng,et al.  RRSi: indexing XML data for proximity twig queries , 2008, Knowledge and Information Systems.