论文信息 - Approximate Matching Between XML Documents and Schemas with Applications in XML Classification and Clustering

Approximate Matching Between XML Documents and Schemas with Applications in XML Classification and Clustering

Classification/clustering of XML documents based on their structural information is important for many tasks related with document management. In this chapter, we present a suite of algorithms to compute the cost for approximate matching between XML documents and schemas. A framework for classifying/ clustering XML documents by structure is then presented based on the computation of distances between XML documents and schemas. The backbone of the framework is the feature representation using a vector of the distances. Experimental studies were conducted on various XML data sets, suggesting the efficiency and effectiveness of our approach as a solution for structural classification/clustering of XML documents.

Guangming Xing

[1] J. Clark,et al. RELAX NG specification , 2001 .

[2] Richi Nayak,et al. XML schema clustering with semantic and hierarchical similarity measures , 2007, Knowl. Based Syst..

[3] Horst Bunke,et al. Classes of cost functions for string edit distance , 2006, Algorithmica.

[4] Timos K. Sellis,et al. A methodology for clustering XML documents by structure , 2006, Inf. Syst..

[5] Eugene W. Myers,et al. Approximately Matching Context-Free Languages , 1995, Inf. Process. Lett..

[6] Sriram Padmanabhan,et al. A framework for the selective dissemination of XML documents based on inferred user profiles , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[7] Elisa Bertino,et al. A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications , 2004, Inf. Syst..

[8] Kaizhong Zhang. A New Editing based Distance between Unordered Labeled Trees , 1993, CPM.

[9] Stefano Spaccapietra,et al. Issues and approaches of database integration , 1998, CACM.

[10] Murali Mani,et al. Taxonomy of XML schema languages using formal language theory , 2005, TOIT.

[11] Richi Nayak,et al. XCLS: A Fast and Effective Clustering Algorithm for Heterogenous XML Documents , 2006, PAKDD.

[12] Gabriel Valiente,et al. An efficient bottom-up distance between trees , 2001, Proceedings Eighth Symposium on String Processing and Information Retrieval.

[13] Eiichi Tanaka,et al. The Tree-to-Tree Editing Problem , 1988, Int. J. Pattern Recognit. Artif. Intell..

[14] AnHai Doan,et al. Matching Schemas in Online Communities: A Web 2.0 Approach , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[15] Kuo-Chung Tai,et al. The Tree-to-Tree Correction Problem , 1979, JACM.

[16] Kaizhong Zhang,et al. On the Editing Distance Between Unordered Labeled Trees , 1992, Inf. Process. Lett..

[17] E. Rodney Canfield,et al. Approximate matching of XML document with regular hedge grammar , 2005, Int. J. Comput. Math..

[18] Kaizhong Zhang,et al. Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[19] Nobutaka Suzuki. Finding an optimum edit script between an XML document and a DTD , 2005, SAC '05.

[20] Aristides Gionis,et al. XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD 2000.

[21] Kaizhong Zhang,et al. Fast Algorithms for the Unit Cost Editing Distance Between Trees , 1990, J. Algorithms.

[22] H. V. Jagadish,et al. Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[23] Zhonghang Xia,et al. Clustering XML Documents Based on Structural Similarity , 2007, DASFAA.

[24] Guangming Xing. Fast Approximate Matching Between XML Documents and Schemata , 2006, APWeb.

[25] Sergio Greco,et al. Semantic clustering of XML documents , 2010, TOIS.

[26] Ioana Manolescu,et al. XMark: A Benchmark for XML Data Management , 2002, VLDB.

[27] Cong Yu,et al. TIMBER: A native XML database , 2002, The VLDB Journal.

[28] Elisa Bertino,et al. Measuring the structural similarity among XML documents and DTDs , 2006, Journal of Intelligent Information Systems.

[29] Ludovic Denoyer,et al. XML Structure Mapping , 2006, INEX.

[30] Amit Kumar,et al. XML stream processing using tree-edit distance embeddings , 2005, TODS.

[31] Anna Formica,et al. Similarity of XML-Schema Elements: A Structural and Information Content Approach , 2008, Comput. J..

[32] Stanley M. Selkow,et al. The Tree-to-Tree Editing Problem , 1977, Inf. Process. Lett..

[33] Shin-Yee Lu. A Tree-to-Tree Distance and Its Application to Cluster Analysis , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34] Torsten Schlieder. Similarity Search in XML Data using Cost-Based Query Transformations , 2001, WebDB.

[35] Eric van der Vlist,et al. XML Schema , 2002 .

[36] Ludovic Denoyer,et al. Report on the XML Mining Track at INEX 2005 and INEX 2006 , 2006, INEX.

[37] Serge Abiteboul,et al. Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[38] Elisa Bertino,et al. Protection and administration of XML data sources , 2002, Data Knowl. Eng..

[39] Richi Nayak,et al. Evaluating the Performance of XML Document Clustering by Structure Only , 2006, INEX.

[40] Tao Jiang,et al. Alignment of Trees - An Alternative to Tree Edit , 1994, CPM.

[41] Zhonghang Xia,et al. Classifying XML Documents Based on Structure/Content Similarity , 2006, INEX.

[42] Alberto H. F. Laender,et al. Automatic web news extraction using tree edit distance , 2004, WWW '04.