Sequential Pattern Mining for Structure-Based XML Document Classification

This article presents an original supervised classification technique for XML documents which is based on structure only. Each XML document is viewed as an ordered labeled tree, represented by his tags only. Our method has three steps. After a cleaning step, we characterize each predefined cluster in terms of frequent structural subsequences. Then we classify the XML documents based on the mined patterns of each cluster.

[1]  Serge Abiteboul,et al.  Extracting schema from semistructured data , 1998, SIGMOD '98.

[2]  Florent Masseglia,et al.  Schema Mining: Finding Structural Regularity among Semistructured Data , 2000, PKDD.

[3]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[4]  Riccardo Ortale,et al.  Distance-based Clustering of XML Documents , 2003 .

[5]  Alexandre Termier,et al.  TreeFinder: a first step towards XML data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[6]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[7]  Mohammed J. Zaki Efficiently mining frequent trees in a forest , 2002, KDD.

[8]  Timos K. Sellis,et al.  Clustering XML Documents Using Structural Summaries , 2004, EDBT Workshops.

[9]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[10]  Kyuseok Shim,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD '00.

[11]  Takayoshi Shoudai,et al.  Discovery of Frequent Tree Structured Patterns in Semistructured Web Documents , 2001, PAKDD.

[12]  Ke Wang,et al.  Discovering Structural Association of Semistructured Data , 2000, IEEE Trans. Knowl. Data Eng..

[13]  Siu-Ming Yiu,et al.  An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.