Fast detection of XML structural similarity

Because of the widespread diffusion of semistructured data in XML format, much research effort is currently devoted to support the storage and retrieval of large collections of such documents. XML documents can be compared as to their structural similarity, in order to group them into clusters so that different storage, retrieval, and processing techniques can be effectively exploited. In this scenario, an efficient and effective similarity function is the key of a successful data management process. We present an approach for detecting structural similarity between XML documents which significantly differs from standard methods based on graph-matching algorithms, and allows a significant reduction of the required computation costs. Our proposal roughly consists of linearizing the structure of each XML document, by representing it as a numerical sequence and, then, comparing such sequences through the analysis of their frequencies. First, some basic strategies for encoding a document are proposed, which can focus on diverse structural facets. Moreover, the theory of discrete Fourier transform is exploited to effectively and efficiently compare the encoded documents (i.e., signals) in the domain of frequencies. Experimental results reveal the effectiveness of the approach, also in comparison with standard methods.

[1]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[2]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[3]  Nicolás Marín,et al.  Review of Data on the Web: from relational to semistructured data and XML by Serge Abiteboul, Peter Buneman, and Dan Suciu. Morgan Kaufmann 1999. , 2003, SGMD.

[4]  Jürgen Wäsch,et al.  Tamino - An Internet Database System , 2000, EDBT.

[5]  Jennifer Widom,et al.  Query Optimization for XML , 1999, VLDB.

[6]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[7]  Yangyong Zhu,et al.  Similarity Metric for XML Documents , 2003 .

[8]  J. Rowling X-Diff : A Fast Change Detection Algorithm for XML Documents , 2003 .

[9]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[10]  M. Hascoet,et al.  Xyleme, a dynamic warehouse for XML data of the Web , 2001, Proceedings 2001 International Database Engineering and Applications Symposium.

[11]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[12]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[13]  A. Mendelzon,et al.  Efficient retrieval of similar time series , 2000 .

[14]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[15]  Hisashi Kashima,et al.  Kernels for Semi-Structured Data , 2002, ICML.

[16]  William H. Press,et al.  Numerical recipes in C , 2002 .

[17]  Jennifer Widom,et al.  Change detection in hierarchically structured information , 1996, SIGMOD '96.

[18]  Siu-Ming Yiu,et al.  An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.

[19]  Elio Masciari,et al.  Detecting Structural Similarities between XML Documents , 2002, WebDB.

[20]  Elisa Bertino,et al.  A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications , 2004, Inf. Syst..

[21]  David J. DeWitt,et al.  X-Diff: an effective change detection algorithm for XML documents , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[22]  Jennifer Widom,et al.  Indexing Semistructured Data , 1998 .

[23]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[24]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[25]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[26]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[27]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[28]  Dina Q. Goldin,et al.  On Similarity Queries for Time-Series Data: Constraint Specification and Implementation , 1995, CP.

[29]  Christos Faloutsos,et al.  Efficient retrieval of similar time sequences under time warping , 1998, Proceedings 14th International Conference on Data Engineering.

[30]  Boaz Porat,et al.  A course in digital signal processing , 1996 .