Hierarchical clustering of XML documents focused on structural components

Clustering XML documents by structure is the task of grouping them by common structural components. Hitherto, this has been accomplished by looking at the occurrence of one preestablished type of structural components in the structures of the XML documents. However, the a-priori chosen structural components may not be the most appropriate for effective clustering. Moreover, it is likely that the resulting clusters exhibit a certain extent of inner structural inhomogeneity, because of uncaught differences in the structures of the XML documents, due to further neglected forms of structural components. To overcome these limitations, a new hierarchical approach is proposed, that allows to consider (if necessary) multiple forms of structural components to isolate structurally-homogeneous clusters of XML documents. At each level of the resulting hierarchy, clusters are divided by considering some type of structural components (unaddressed at the preceding levels), that still differentiate the structures of the XML documents. Each cluster in the hierarchy is summarized through a novel technique, that provides a clear and differentiated understanding of its structural properties. A comparative evaluation over both real and synthetic XML data proves that the devised approach outperforms established competitors in effectiveness and scalability. Cluster summarization is also shown to be very representative.

[1]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[2]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[3]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[4]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[5]  Masatoshi Yoshikawa,et al.  Storage and Retrieval of XML Documents Using Object-Relational Databases , 1999, DEXA.

[6]  Andrew Lim,et al.  Indexing graph-structured XML data for efficient structural join operation , 2006, Data Knowl. Eng..

[7]  Charu C. Aggarwal,et al.  XRules: An effective algorithm for structural classification of XML data , 2006, Machine Learning.

[8]  Ludovic Denoyer,et al.  The Wikipedia XML Corpus , 2006, INEX.

[9]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[10]  David Buttler,et al.  A Short Survey of Document Structure Similarity Algorithms , 2004, International Conference on Internet Computing.

[11]  Alin Deutsch,et al.  Storing semistructured data with STORED , 1999, SIGMOD '99.

[12]  Kaleem Siddiqi,et al.  Matching Hierarchical Structures Using Association Graphs , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Sven Helmer,et al.  Measuring the Structural Similarity of Semistructured Documents Using Entropy , 2007, VLDB.

[14]  Nicolás Marín,et al.  Review of Data on the Web: from relational to semistructured data and XML by Serge Abiteboul, Peter Buneman, and Dan Suciu. Morgan Kaufmann 1999. , 2003, SGMD.

[15]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[16]  Charu C. Aggarwal,et al.  Xproj: a framework for projected structural clustering of xml documents , 2007, KDD '07.

[17]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[18]  Hector Garcia-Molina,et al.  Meaningful change detection in structured data , 1997, SIGMOD '97.

[19]  Ludovic Denoyer,et al.  Overview of the INEX 2008 XML Mining Track , 2008, INEX.

[20]  Joachim Hammer,et al.  Element matching across data-oriented XML sources using a multi-strategy clustering model , 2004, Data Knowl. Eng..

[21]  David J. DeWitt,et al.  X-Diff: an effective change detection algorithm for XML documents , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[22]  Sudarshan S. Chawathe,et al.  Comparing Hierarchical Data in External Memory , 1999, VLDB.

[23]  Choi Il-Hwan,et al.  A Clustering Method Based on Path Similarities of XML Data , 2006 .

[24]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[25]  Elio Masciari,et al.  Fast detection of XML structural similarity , 2005, IEEE Transactions on Knowledge and Data Engineering.

[26]  Ludovic Denoyer,et al.  The XML Wikipedia Corpus , 2006 .

[27]  Eugenio Cesario,et al.  Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[28]  Jennifer Widom,et al.  Change detection in hierarchically structured information , 1996, SIGMOD '96.

[29]  Gianni Costa,et al.  On Effective XML Clustering by Path Commonality: An Efficient and Scalable Algorithm , 2012, 2012 IEEE 24th International Conference on Tools with Artificial Intelligence.

[30]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[31]  Gianni Costa,et al.  X-Class: Associative Classification of XML Documents by Structure , 2013, TOIS.

[32]  Hyoung-Joo Kim,et al.  A partition index for XML and semi-structured data , 2004, Data Knowl. Eng..

[33]  Mong-Li Lee,et al.  XClust: clustering XML schemas for effective integration , 2002, CIKM '02.

[34]  Devavrat Shah,et al.  Turbo-charging vertical mining of large databases , 2000, SIGMOD '00.

[35]  Nikos Tsirakis,et al.  XEdge: clustering homogeneous and heterogeneous XML documents using edge summaries , 2008, SAC '08.

[36]  Mohammed J. Zaki Efficiently mining frequent trees in a forest: algorithms and applications , 2005, IEEE Transactions on Knowledge and Data Engineering.

[37]  J. Rowling X-Diff : A Fast Change Detection Algorithm for XML Documents , 2003 .

[38]  Harald Schöning Tamino - A DBMS designed for XML , 2001, ICDE.

[39]  Richi Nayak,et al.  Overview of the INEX 2009 XML Mining Track: Clustering and Classification of XML Documents , 2009, INEX.

[40]  Yang Wen Semantic integration of structured and semistructured data sources , 2002 .

[41]  Elisa Bertino,et al.  A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications , 2004, Inf. Syst..

[42]  Kyuseok Shim,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD '00.

[43]  Ludovic Denoyer,et al.  Report on the XML mining track at INEX 2007 categorization and clustering of XML documents , 2008, SIGF.

[44]  Ricardo A. Baeza-Yates,et al.  Introduction to the special issue on XML retrieval , 2006, TOIS.

[45]  Maxime Crochemore,et al.  Pattern Matching in Strings , 2010, Algorithms and Theory of Computation Handbook.

[46]  Roy Goldman,et al.  Lore: a database management system for semistructured data , 1997, SGMD.

[47]  Richi Nayak,et al.  XCLS: A Fast and Effective Clustering Algorithm for Heterogenous XML Documents , 2006, PAKDD.

[48]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[49]  Eiichi Tanaka,et al.  The Tree-to-Tree Editing Problem , 1988, Int. J. Pattern Recognit. Artif. Intell..

[50]  Mohammed J. Zaki,et al.  Fast vertical mining using diffsets , 2003, KDD '03.

[51]  Richi Nayak,et al.  Fast and effective clustering of XML data using structural information , 2008, Knowledge and Information Systems.

[52]  Timos K. Sellis,et al.  A methodology for clustering XML documents by structure , 2006, Inf. Syst..

[53]  Serge Abiteboul,et al.  Extracting schema from semistructured data , 1998, SIGMOD '98.

[54]  Gianni Costa,et al.  A Tree-Based Approach to Clustering XML Documents by Structure , 2004, PKDD.

[55]  Sven Helmer,et al.  Anatomy of a native XML base management system , 2002, The VLDB Journal.

[56]  Kaizhong Zhang,et al.  A System for Approximate Tree Matching , 1994, IEEE Trans. Knowl. Data Eng..

[57]  Gianni Costa,et al.  Effective XML Classification Using Content and Structural Information via Rule Learning , 2011, 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence.

[58]  David J. DeWitt,et al.  Relational Databases for Querying XML Documents: Limitations and Opportunities , 1999, VLDB.

[59]  Cong Yu,et al.  TIMBER: A native XML database , 2002, The VLDB Journal.

[60]  Siu-Ming Yiu,et al.  An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.

[61]  Richi Nayak,et al.  XML data clustering: An overview , 2011, CSUR.

[62]  Ke Wang,et al.  Discovering typical structures of documents: a road map approach , 1998, SIGIR '98.