On Effective XML Clustering by Path Commonality: An Efficient and Scalable Algorithm

XML clustering by structure is, in its most general form, the process of partitioning a corpus of XML documents into disjoint clusters, such that intra-cluster structural homogeneity is high and inter-cluster structural homogeneity is low. In this paper, we propose an algorithm that implements a partitioning strategy, in which root-to-leaf paths are used to separate the XML documents. Paths are discriminatory substructures and, thus, the effectiveness of our algorithm is accordingly high. Moreover, a suitable encoding is adopted for representing and testing the occurrence of the individual paths within each XML document independently of the length of such paths. Not only this expedites clustering, but it also makes our algorithm scalable to process large-scale corpora of XML documents. A comparative evaluation over several standard (real-word and synthetic) XML corpora reveals that our algorithm outperforms all of its competitors in efficiency and scalability, while being as effective as the top-notch competitors. One especially appealing property of the proposed algorithm is that it achieves these levels of performance by automatically establishing a natural number of clusters to be discovered in the underlying XML corpus.

[1]  Ludovic Denoyer,et al.  Overview of the INEX 2008 XML Mining Track , 2008, INEX.

[2]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[3]  Serge Abiteboul,et al.  Extracting schema from semistructured data , 1998, SIGMOD '98.

[4]  Gianni Costa,et al.  A Tree-Based Approach to Clustering XML Documents by Structure , 2004, PKDD.

[5]  Gianni Costa,et al.  Effective XML Classification Using Content and Structural Information via Rule Learning , 2011, 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence.

[6]  Kyuseok Shim,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD '00.

[7]  HalkidiMaria,et al.  Cluster validity methods , 2002 .

[8]  Charu C. Aggarwal,et al.  XRules: An effective algorithm for structural classification of XML data , 2006, Machine Learning.

[9]  Andrew Trotman,et al.  Report on INEX 2008 , 2009, SIGF.

[10]  Richi Nayak,et al.  Overview of the INEX 2009 XML Mining Track: Clustering and Classification of XML Documents , 2009, INEX.

[11]  Silvana Castano,et al.  Semantic integration of semistructured and structured data sources , 1999, SGMD.

[12]  Charu C. Aggarwal,et al.  Xproj: a framework for projected structural clustering of xml documents , 2007, KDD '07.

[13]  Ke Wang,et al.  Discovering typical structures of documents: a road map approach , 1998, SIGIR '98.

[14]  Timos K. Sellis,et al.  A methodology for clustering XML documents by structure , 2006, Inf. Syst..

[15]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[16]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[17]  Ludovic Denoyer,et al.  Report on the XML mining track at INEX 2007 categorization and clustering of XML documents , 2008, SIGF.

[18]  Ricardo A. Baeza-Yates,et al.  Introduction to the special issue on XML retrieval , 2006, TOIS.

[19]  Gianni Costa,et al.  A Transactional Approach to Associative XML Classification by Content and Structure , 2011, KDIR.

[20]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[21]  Siu-Ming Yiu,et al.  An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.

[22]  Sudarshan S. Chawathe,et al.  Comparing Hierarchical Data in External Memory , 1999, VLDB.

[23]  Mong-Li Lee,et al.  XClust: clustering XML schemas for effective integration , 2002, CIKM '02.

[24]  Gianni Costa,et al.  Structure-oriented clustering of XML documents: A transactional approach , 2012, 2012 6th IEEE International Conference Intelligent Systems.

[25]  Ludovic Denoyer,et al.  Report on the XML mining track at INEX 2005 and INEX 2006: categorization and clustering of XML documents , 2007, SIGF.

[26]  Eugenio Cesario,et al.  Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data , 2007, IEEE Transactions on Knowledge and Data Engineering.