Approximate Matching Between XML Documents and Schemas with Applications in XML Classification and Clustering

Classification/clustering of XML documents based on their structural information is important for many tasks related with document management. In this chapter, we present a suite of algorithms to compute the cost for approximate matching between XML documents and schemas. A framework for classifying/ clustering XML documents by structure is then presented based on the computation of distances between XML documents and schemas. The backbone of the framework is the feature representation using a vector of the distances. Experimental studies were conducted on various XML data sets, suggesting the efficiency and effectiveness of our approach as a solution for structural classification/clustering of XML documents.

[1]  J. Clark,et al.  RELAX NG specification , 2001 .

[2]  Richi Nayak,et al.  XML schema clustering with semantic and hierarchical similarity measures , 2007, Knowl. Based Syst..

[3]  Horst Bunke,et al.  Classes of cost functions for string edit distance , 2006, Algorithmica.

[4]  Timos K. Sellis,et al.  A methodology for clustering XML documents by structure , 2006, Inf. Syst..

[5]  Eugene W. Myers,et al.  Approximately Matching Context-Free Languages , 1995, Inf. Process. Lett..

[6]  Sriram Padmanabhan,et al.  A framework for the selective dissemination of XML documents based on inferred user profiles , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[7]  Elisa Bertino,et al.  A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications , 2004, Inf. Syst..

[8]  Kaizhong Zhang A New Editing based Distance between Unordered Labeled Trees , 1993, CPM.

[9]  Stefano Spaccapietra,et al.  Issues and approaches of database integration , 1998, CACM.

[10]  Murali Mani,et al.  Taxonomy of XML schema languages using formal language theory , 2005, TOIT.

[11]  Richi Nayak,et al.  XCLS: A Fast and Effective Clustering Algorithm for Heterogenous XML Documents , 2006, PAKDD.

[12]  Gabriel Valiente,et al.  An efficient bottom-up distance between trees , 2001, Proceedings Eighth Symposium on String Processing and Information Retrieval.

[13]  Eiichi Tanaka,et al.  The Tree-to-Tree Editing Problem , 1988, Int. J. Pattern Recognit. Artif. Intell..

[14]  AnHai Doan,et al.  Matching Schemas in Online Communities: A Web 2.0 Approach , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[15]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[16]  Kaizhong Zhang,et al.  On the Editing Distance Between Unordered Labeled Trees , 1992, Inf. Process. Lett..

[17]  E. Rodney Canfield,et al.  Approximate matching of XML document with regular hedge grammar , 2005, Int. J. Comput. Math..

[18]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[19]  Nobutaka Suzuki Finding an optimum edit script between an XML document and a DTD , 2005, SAC '05.

[20]  Aristides Gionis,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD 2000.

[21]  Kaizhong Zhang,et al.  Fast Algorithms for the Unit Cost Editing Distance Between Trees , 1990, J. Algorithms.

[22]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[23]  Zhonghang Xia,et al.  Clustering XML Documents Based on Structural Similarity , 2007, DASFAA.

[24]  Guangming Xing Fast Approximate Matching Between XML Documents and Schemata , 2006, APWeb.

[25]  Sergio Greco,et al.  Semantic clustering of XML documents , 2010, TOIS.

[26]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.

[27]  Cong Yu,et al.  TIMBER: A native XML database , 2002, The VLDB Journal.

[28]  Elisa Bertino,et al.  Measuring the structural similarity among XML documents and DTDs , 2006, Journal of Intelligent Information Systems.

[29]  Ludovic Denoyer,et al.  XML Structure Mapping , 2006, INEX.

[30]  Amit Kumar,et al.  XML stream processing using tree-edit distance embeddings , 2005, TODS.

[31]  Anna Formica,et al.  Similarity of XML-Schema Elements: A Structural and Information Content Approach , 2008, Comput. J..

[32]  Stanley M. Selkow,et al.  The Tree-to-Tree Editing Problem , 1977, Inf. Process. Lett..

[33]  Shin-Yee Lu A Tree-to-Tree Distance and Its Application to Cluster Analysis , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Torsten Schlieder Similarity Search in XML Data using Cost-Based Query Transformations , 2001, WebDB.

[35]  Eric van der Vlist,et al.  XML Schema , 2002 .

[36]  Ludovic Denoyer,et al.  Report on the XML Mining Track at INEX 2005 and INEX 2006 , 2006, INEX.

[37]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[38]  Elisa Bertino,et al.  Protection and administration of XML data sources , 2002, Data Knowl. Eng..

[39]  Richi Nayak,et al.  Evaluating the Performance of XML Document Clustering by Structure Only , 2006, INEX.

[40]  Tao Jiang,et al.  Alignment of Trees - An Alternative to Tree Edit , 1994, CPM.

[41]  Zhonghang Xia,et al.  Classifying XML Documents Based on Structure/Content Similarity , 2006, INEX.

[42]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.