论文信息 - Measuring the Structural Similarity of Semistructured Documents Using Entropy

Measuring the Structural Similarity of Semistructured Documents Using Entropy

We propose a technique for measuring the structural similarity of semistructured documents based on entropy. After extracting the structural information from two documents we use either Ziv-Lempel encoding or Ziv-Merhav crossparsing to determine the entropy and consequently the similarity between the documents. To the best of our knowledge, this is the first true linear-time approach for evaluating structural similarity. In an experimental evaluation we demonstrate that the results of our algorithm in terms of clustering quality are on a par with or even better than existing approaches.

Sven Helmer | S. Helmer

[1] Serge Abiteboul,et al. Extracting schema from semistructured data , 1998, SIGMOD '98.

[2] P. Sneath,et al. Numerical Taxonomy , 1962, Nature.

[3] Kaizhong Zhang,et al. A System for Approximate Tree Matching , 1994, IEEE Trans. Knowl. Data Eng..

[4] Erhard Rahm,et al. Generic Schema Matching with Cupid , 2001, VLDB.

[5] André Martins. String kernels and similarity measures for information retrieval , 2006 .

[6] Stanley M. Selkow,et al. The Tree-to-Tree Editing Problem , 1977, Inf. Process. Lett..

[7] Timos K. Sellis,et al. A methodology for clustering XML documents by structure , 2006, Inf. Syst..

[8] Jennifer Widom,et al. Change detection in hierarchically structured information , 1996, SIGMOD '96.

[9] Elisa Bertino,et al. An abstraction-based approach to measuring the structural similarity between two unordered XML documents , 2003, ISICT.

[10] B. E. Eckbo,et al. Appendix , 1826, Epilepsy Research.

[11] Ming Li,et al. An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[12] Paul M. B. Vitányi,et al. Clustering by compression , 2003, IEEE Transactions on Information Theory.

[13] Gerhard Weikum,et al. The XXL search engine: ranked retrieval of XML data using indexes and ontologies , 2002, SIGMOD '02.

[14] Donald E. Knuth,et al. The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition , 1997 .

[15] H. V. Jagadish,et al. Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[16] Mark Levene,et al. XML Structure Compression , 2002, WebDyn@WWW.

[17] Hector Garcia-Molina,et al. Meaningful change detection in structured data , 1997, SIGMOD '97.

[18] Elio Masciari,et al. Fast detection of XML structural similarity , 2005, IEEE Transactions on Knowledge and Data Engineering.

[19] Paul M. B. Vitányi,et al. An Introduction to Kolmogorov Complexity and Its Applications, Third Edition , 1997, Texts in Computer Science.

[20] Ian H. Witten,et al. Managing gigabytes , 1994 .

[21] Abraham Lempel,et al. A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[22] Neri Merhav,et al. A measure of relative entropy between individual sequences with application to universal classification , 1993, IEEE Trans. Inf. Theory.

[23] Tobias Dönz. Extracting Structured Data from Web Pages , 2003 .

[24] N. Jesper Larsson. Extended application of suffix trees to data compression , 1996, Proceedings of Data Compression Conference - DCC '96.

[25] Elisa Bertino,et al. A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications , 2004, Inf. Syst..

[26] Kyuseok Shim,et al. XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD '00.

[27] Salvatore J. Stolfo,et al. The merge/purge problem for large databases , 1995, SIGMOD '95.

[28] Donald Ervin Knuth,et al. The Art of Computer Programming , 1968 .

[29] Rajeev Motwani,et al. Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[30] Kuo-Chung Tai,et al. The Tree-to-Tree Correction Problem , 1979, JACM.

[31] Péter Gács,et al. Information Distance , 1998, IEEE Trans. Inf. Theory.

[32] Mong-Li Lee,et al. XClust: clustering XML schemas for effective integration , 2002, CIKM '02.

[33] David Buttler,et al. A Short Survey of Document Structure Similarity Algorithms , 2004, International Conference on Internet Computing.

[34] Alberto H. F. Laender,et al. Automatic web news extraction using tree edit distance , 2004, WWW '04.

[35] Kaizhong Zhang,et al. Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[36] Valter Crescenzi,et al. RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[37] Georg Gottlob,et al. Visual Web Information Extraction with Lixto , 2001, VLDB.

[38] Rajeev Rastogi,et al. Capturing both types and constraints in data integration , 2003, SIGMOD '03.