A stepwise learning approach to automatic discovery of interest data blocks

The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. A key problem with the existing wrappers is that the wrapped rule learned from the examples is only adaptive for the specific Web site. We propose a novel approach, DBFinder, to discover interest data blocks from a set of Web pages. It is a key step in the data extraction. The process of DBFinder consists of two phases: semi-supervised wrapping and unsupervised wrapper. The goal of the first phase is to learn the wrapped rules for the specific Web site. The goal of the second phase is to popularize the wrapped rules for other Web sites in the same domain with the sample Web site. Two kinds of data mining techniques, frequent sub-tree mining and association rule mining, are used to accomplish such a goal. To demonstrate the feasibility of our approach, some detailed experiments are conducted. We have also applied our approach in a real application, which is a comparison-shopping agent.