An Effective Data Processing Method for Fast Clustering

Because of the extensive diffusion of Internet usage, heterogeneous computing platforms, and ubiquitous computing technologies, Web data that are usually written in XML format are explosively increased. With the growth of Web data and the importance of their clustering, we need similarity detection method because it is a fundamental technology for efficient document management. In this paper, we introduce a similarity detection method that can check both semantic similarity and structural similarity between XML DTDs. For semantic checking, we adopt ontology technology, and we apply longest common string and longest nesting common string methods for structural checking. Our similarity detection method uses multi-tag sequences instead of traversing XML schema trees, so that it gets fast and reasonable results.

[1]  Mong-Li Lee,et al.  XClust: clustering XML schemas for effective integration , 2002, CIKM '02.

[2]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[3]  Dino Pedreschi,et al.  Knowledge Discovery in Databases: PKDD 2004 , 2004, Lecture Notes in Computer Science.

[4]  Alberto O. Mendelzon,et al.  Similarity-based queries for time series data , 1997, SIGMOD '97.

[5]  Shimon Ullman,et al.  Combining Top-Down and Bottom-Up Segmentation , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[6]  Timos K. Sellis,et al.  Clustering XML Documents Using Structural Summaries , 2004, EDBT Workshops.

[7]  Wolfgang Lindner,et al.  Current Trends in Database Technology - EDBT 2004 Workshops, EDBT 2004 Workshops PhD, DataX, PIM, P2P&DB, and ClustWeb, Heraklion, Crete, Greece, March 14-18, 2004, Revised Selected Papers , 2004, EDBT Workshops.

[8]  Olga Baysal,et al.  diffX: an algorithm to detect changes in multi-version XML documents , 2005, CASCON.

[9]  Siu-Ming Yiu,et al.  An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.

[10]  Philip N. Klein,et al.  A tree-edit-distance algorithm for comparing simple, closed shapes , 2000, SODA '00.

[11]  Gianni Costa,et al.  A Tree-Based Approach to Clustering XML Documents by Structure , 2004, PKDD.

[12]  Kaizhong Zhang,et al.  On the Editing Distance between Undirected Acyclic Graphs and Related Problems , 1995, CPM.