Mining frequent patterns from XML data

The Web is rich with information. However, the data contained in the web is not well organized which makes obtaining useful information from the Web a difficult task. The successful development of extensible Markup Language (XML) as a standard to represent semi structured data makes the data contained in the Web more readable and the task of mining useful information from the Web becomes feasible. XML has become very popular for representing semistructured data and a standard for data exchange over the Web. Mining XML data from the Web is becoming increasingly important. The previous studies adopt an Apriori-like candidate set generation approach but candidate set generation is still costly. We propose that extracting association rules from XML documents without any preprocessing or postprocessing using XML query language XQuery is possible and analyze the XQuery implementation of the efficient FP-tree based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. FP-tree based mining adopts a pattern fragment growth method to avoid the costly generation of a large number of candidate sets and a partition-based, divide-and-conquer method is used. Divide-and-conquer method divides the problem into a number of subproblems and the subproblems by solving them recursively. If the subproblem sizes are small enough, however, just solve the subproblems in a straightforward manner and then combine the solutions to the subproblems into the solution for the original problem. In addition, we suggest features that need to be added into XQuery in order to make the implementation of the FP growth more efficient