Structure-Oriented Techniques for XML Document Partitioning

Focusing on only one type of structural component in the process of clustering XML documents may produce clusters with a certain extent of inner structural inhomogeneity, due either to uncaught differences in the overall logical structures of the available XML documents or to inappropriate choices of the targeted structural component. To overcome these limitations, two approaches to clustering XML documents by multiple heterogeneous structures are proposed. An approach looks at the simultaneous occurrences of such structures across the individual XML documents. The other approach instead combines multiple clusterings of the XML documents, separately performed with respect to the individual types of structures in isolation. A comparative evaluation over both real and synthetic XML data proved that the effectiveness of the devised approaches is at least on a par and even superior with respect to the effectiveness of state-of-the-art competitors. Additionally, the empirical evidence also reveals that the proposed approaches outperform such competitors in terms of time efficiency.

[1]  Gianni Costa,et al.  Structure-oriented clustering of XML documents: A transactional approach , 2012, 2012 6th IEEE International Conference Intelligent Systems.

[2]  Silvana Castano,et al.  Semantic integration of semistructured and structured data sources , 1999, SGMD.

[3]  Charu C. Aggarwal,et al.  Xproj: a framework for projected structural clustering of xml documents , 2007, KDD '07.

[4]  Andrew Trotman,et al.  Report on INEX 2008 , 2009, SIGF.

[5]  Richi Nayak,et al.  Overview of the INEX 2009 XML Mining Track: Clustering and Classification of XML Documents , 2009, INEX.

[6]  Gianni Costa,et al.  On Effective XML Clustering by Path Commonality: An Efficient and Scalable Algorithm , 2012, 2012 IEEE 24th International Conference on Tools with Artificial Intelligence.

[7]  Gianni Costa,et al.  X-Class: Associative Classification of XML Documents by Structure , 2013, TOIS.

[8]  Gianni Costa,et al.  Hierarchical clustering of XML documents focused on structural components , 2013, Data Knowl. Eng..

[9]  Mohammed J. Zaki Efficiently mining frequent trees in a forest: algorithms and applications , 2005, IEEE Transactions on Knowledge and Data Engineering.

[10]  Erik Wilde,et al.  XML fever , 2008, CACM.

[11]  Tao Li,et al.  On combining multiple clusterings: an overview and a new perspective , 2010, Applied Intelligence.

[12]  Riccardo Ortale,et al.  Distance-based Clustering of XML Documents , 2003 .

[13]  Gianni Costa,et al.  A hierarchical model-based approach to co-clustering high-dimensional data , 2008, SAC '08.

[14]  Serge Abiteboul,et al.  Extracting schema from semistructured data , 1998, SIGMOD '98.

[15]  Gianni Costa,et al.  A Tree-Based Approach to Clustering XML Documents by Structure , 2004, PKDD.

[16]  Timos K. Sellis,et al.  A methodology for clustering XML documents by structure , 2006, Inf. Syst..

[17]  Gianni Costa,et al.  Effective XML Classification Using Content and Structural Information via Rule Learning , 2011, 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence.

[18]  Eugenio Cesario,et al.  Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[19]  Liping Zhao,et al.  Patterns, symmetry, and symmetry breaking , 2008, CACM.

[20]  Aristides Gionis,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD 2000.

[21]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[22]  Ludovic Denoyer,et al.  Report on the XML mining track at INEX 2005 and INEX 2006: categorization and clustering of XML documents , 2007, SIGF.

[23]  Siu-Ming Yiu,et al.  An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.

[24]  Richi Nayak,et al.  XML data clustering: An overview , 2011, CSUR.

[25]  Ludovic Denoyer,et al.  Report on the XML mining track at INEX 2007 categorization and clustering of XML documents , 2008, SIGF.

[26]  Ricardo A. Baeza-Yates,et al.  Introduction to the special issue on XML retrieval , 2006, TOIS.

[27]  Ke Wang,et al.  Discovering typical structures of documents: a road map approach , 1998, SIGIR '98.

[28]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[29]  Gianni Costa,et al.  An incremental clustering scheme for data de-duplication , 2009, Data Mining and Knowledge Discovery.

[30]  Charu C. Aggarwal,et al.  XRules: an effective structural classifier for XML data , 2003, KDD '03.

[31]  Ludovic Denoyer,et al.  Overview of the INEX 2008 XML Mining Track , 2008, INEX.

[32]  Sachindra Joshi,et al.  A bag of paths model for measuring structural similarity in Web documents , 2003, KDD '03.