A methodology for clustering XML documents by structure

The processing and management of XML data are popular research issues. However, operations based on the structure of XML data have not received strong attention. These operations involve, among others, the grouping of structurally similar XML documents. Such grouping results from the application of clustering methods with distances that estimate the similarity between tree structures. This paper presents a framework for clustering XML documents by structure. Modeling the XML documents as rooted ordered labeled trees, we study the usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents. We suggest the usage of structural summaries for trees to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Our approach is tested using a prototype testbed.

[1]  Jennifer Widom,et al.  The TSIMMIS Approach to Mediation: Data Models and Languages , 1997, Journal of Intelligent Information Systems.

[2]  Eugene W. Myers,et al.  AnO(ND) difference algorithm and its variations , 1986, Algorithmica.

[3]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .

[4]  Wesley W. Chu,et al.  Configurable indexing and ranking for XML information retrieval , 2004, SIGIR '04.

[5]  Jennifer Widom,et al.  Object exchange across heterogeneous information sources , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[6]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[7]  Horst Bunke,et al.  A graph distance metric based on the maximal common subgraph , 1998, Pattern Recognit. Lett..

[8]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[9]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[10]  SagivYehoshua,et al.  The TSIMMIS Approach to Mediation , 1997 .

[11]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[12]  Norbert Fuhr,et al.  XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[13]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[14]  Stanley M. Selkow,et al.  The Tree-to-Tree Editing Problem , 1977, Inf. Process. Lett..

[15]  Kyuseok Shim,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD '00.

[16]  Sudarshan S. Chawathe,et al.  Comparing Hierarchical Data in External Memory , 1999, VLDB.

[17]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[18]  Michalis Vazirgiannis,et al.  Clustering algorithms and validity measures , 2001, Proceedings Thirteenth International Conference on Scientific and Statistical Database Management. SSDBM 2001.

[19]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[20]  David J. DeWitt,et al.  X-Diff: an effective change detection algorithm for XML documents , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[21]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[22]  E. Voorhees The Effectiveness & Efficiency of Agglomerative Hierarchic Clustering in Document Retrieval , 1985 .

[23]  J W Ballard,et al.  Data on the web? , 1995, Science.

[24]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[25]  David D. Lewis,et al.  Evaluating Text Categorization I , 1991, HLT.

[26]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[27]  L. Hubert,et al.  A general statistical framework for assessing categorical clustering in free recall. , 1976 .

[28]  Siu-Ming Yiu,et al.  An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.

[29]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[30]  Kyuseok Shim,et al.  XTRACT: Learning Document Type Descriptors from XML Document Collections , 2004, Data Mining and Knowledge Discovery.

[31]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[32]  Vijay V. Raghavan,et al.  Bitmap Indexing-based Clustering and Retrieval of XML Documents , 2001 .

[33]  J. Gower,et al.  Minimum Spanning Trees and Single Linkage Cluster Analysis , 1969 .

[34]  Neoklis Polyzotis,et al.  Statistical synopses for graph-structured XML databases , 2002, SIGMOD '02.

[35]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[36]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[37]  Maureen Caudill Expert networks , 1990 .

[38]  Nicholas Rhodes Review of XML data management: Native XML and XML-Enabled database systems by Akmal B. Chaudhri, Awais Rashid, and Roberto Zicari, Boston, MA: Addison-Wesley 2003 , 2004 .

[39]  Frank Wm. Tompa,et al.  Specifying Transformations for Structured Documents , 2001, WebDB.

[40]  Jennifer Widom,et al.  Change detection in hierarchically structured information , 1996, SIGMOD '96.

[41]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[42]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[43]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[44]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[45]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[46]  Elio Masciari,et al.  Detecting Structural Similarities between XML Documents , 2002, WebDB.

[47]  David J. DeWitt,et al.  Relational Databases for Querying XML Documents: Limitations and Opportunities , 1999, VLDB.